How to parse invoices accurately with bem

How to parse invoices accurately with bem

Invoices can be gnarly. Here's a production-ready, enterprise workflow to extract their data accurately (and enrich it with your source of truth).

Antonio Bustamante
Antonio BustamanteJan 5, 2026
Guides
Listen / Watch

If you are building an accounts payable automation or a spend management platform, you know the reality of invoices: they are chaotic. They arrive as PDFs, images, or forwarded emails. They have infinite layouts. Tables span multiple pages. Line items are messy.

Most developers start by throwing these files at a generic LLM with a prompt like "Extract the invoice number and line items."

This works for a hackathon demo. It fails in production.

In finance, "mostly accurate" is a bug. You cannot have a 95% success rate when processing millions of dollars in payments. You need deterministic outputs from probabilistic inputs.

At bem, we don't do "vibe coding." We build atomic units of functionality that enforce strict contracts between your messy data and your database.

Here is how to engineer a production-ready invoice pipeline that doesn't just extract text, but maps it to your internal source of truth.

1. The Contract: Defining the Transform Function

We don't send a chat message to a model asking it to "be helpful." We define a Transform Function. This is a strict schema definition that forces the underlying inference engine to adhere to a specific JSON structure.

For an invoice, we care about specific fields: Vendor Name, Invoice Number, Date, and the Line Items table

typescript
1// POST /v2/functions
2// Creating the "Invoice Extractor" Transform Function
3
4const invoiceSchema = {
5 type: "object",
6 properties: {
7 vendor_name: { type: "string", description: "The name of the vendor issuing the invoice" },
8 invoice_number: { type: "string", description: "Unique identifier for the invoice" },
9 total_amount: { type: "number", description: "The final total including tax" },
10 line_items: {
11 type: "array",
12 items: {
13 type: "object",
14 properties: {
15 description: { type: "string" },
16 quantity: { type: "number" },
17 unit_price: { type: "number" },
18 total: { type: "number" }
19 },
20 required: ["description", "total"]
21 }
22 }
23 },
24 required: ["vendor_name", "total_amount", "line_items"]
25};
26
27await fetch("[https://api.bem.ai/v2/functions](https://api.bem.ai/v2/functions)", {
28 method: "POST",
29 headers: { "x-api-key": process.env.BEM_API_KEY }, // You can get this by signing up on app.bem.ai !
30 body: JSON.stringify({
31 functionName: "invoice-extractor-v1",
32 type: "transform",
33 outputSchemaName: "Invoice Schema",
34 outputSchema: invoiceSchema,
35 })
36});
37

2. The Source of Truth: Collections & Enrichment

Extracting "Amazon Web Services" from a PDF is easy. Knowing that "Amazon Web Services", "AWS", and "Amazon.com" all map to Vendor ID vnd_9982 in your ERP is hard.

Standard LLMs hallucinate these mappings. bem uses Collections and Enrich Functions to ground AI outputs in your actual data.

First, we upload your vendor master list to a Collection. This is a vector-embedded database managed by bem.

typescript
1// 1. Initialize the Collection
2// POST /v2/collections
3
4await fetch("[https://api.bem.ai/v2/collections](https://api.bem.ai/v2/collections)", {
5 method: "POST",
6 headers: { "x-api-key": process.env.BEM_API_KEY },
7 body: JSON.stringify({
8 collectionName: "vendor-master-list"
9 })
10});
11
12// 2. Upload your Vendor Master List
13// POST /v2/collections/items
14
15await fetch("[https://api.bem.ai/v2/collections/items](https://api.bem.ai/v2/collections/items)", {
16 method: "POST",
17 headers: { "x-api-key": process.env.BEM_API_KEY },
18 body: JSON.stringify({
19 collectionName: "vendor-master-list",
20 items: [
21 { data: { id: "vnd_9982", name: "Amazon Web Services", category: "Cloud Infrastructure" } },
22 { data: { id: "vnd_5521", name: "WeWork", category: "Office Rent" } },
23 // ...
24 ]
25 })
26});

Next, we create an Enrich Function. This function takes the extracted vendor name from step 1, performs a semantic search against your Collection, and appends the correct Vendor ID to the payload.

This isn't a guess; it's a retrieval-augmented lookup configured strictly via JMESPath selectors.

typescript
1// POST /v2/functions
2// Creating the "Vendor Matcher" Enrich Function
3
4await fetch("[https://api.bem.ai/v2/functions](https://api.bem.ai/v2/functions)", {
5 method: "POST",
6 headers: { "x-api-key": process.env.BEM_API_KEY },
7 body: JSON.stringify({
8 functionName: "vendor-enricher-v1",
9 type: "enrich",
10 config: {
11 steps: [
12 {
13 // Take the extracted name from the previous function
14 sourceField: "vendor_name",
15 // Search against your master list
16 collectionName: "vendor-master-list",
17 // Inject the result into a new field
18 targetField: "matched_vendor_record",
19 // Use hybrid search for best results (keyword + semantic)
20 searchMode: "hybrid",
21 topK: 1
22 }
23 ]
24 }
25 })
26});

3. Now the magic of orchestration: The Workflow

Now we chain them together. In bem, a Workflow is a Directed Acyclic Graph (DAG) of functions. The output of the invoice-extractor becomes the input of the vendor-enricher.

typescript
1// POST /v2/workflows
2// Linking extraction and enrichment
3
4await fetch("[https://api.bem.ai/v2/workflows](https://api.bem.ai/v2/workflows)", {
5 method: "POST",
6 headers: { "x-api-key": process.env.BEM_API_KEY },
7 body: JSON.stringify({
8 name: "process-invoices-production",
9 mainFunction: {
10 name: "invoice-extractor-v1",
11 versionNum: 1
12 },
13 relationships: [
14 {
15 sourceFunction: { name: "invoice-extractor-v1", versionNum: 1 },
16 destinationFunction: { name: "vendor-enricher-v1", versionNum: 1 }
17 }
18 ]
19 })
20});

4. Production: The Event-Driven Loop

We built bem to be asynchronous by default. Blocking APIs (waiting for a response) are fragile. If you send a 50-page invoice to a blocking API, it will eventually timeout.

Our architecture allows you to dispatch thousands of invoices simultaneously without degrading performance.

Dispatch (The Call)

You trigger a workflow with a single API call. We accept base64 strings and multipart-form.

typescript
1// POST /v2/calls
2// Fire and forget. Non-blocking.
3
4await fetch("[https://api.bem.ai/v2/calls](https://api.bem.ai/v2/calls)", {
5 method: "POST",
6 headers: { "x-api-key": process.env.BEM_API_KEY },
7 body: JSON.stringify({
8 calls: [{
9 workflowName: "process-invoices-production",
10 // Pass your own internal ID for tracking
11 callReferenceID: "invoice_db_id_8823",
12 input: {
13 singleFile: {
14 inputType: "pdf",
15 inputContent: "base64_encoded_pdf_string..."
16 }
17 }
18 }]
19 })
20});

React (The Webhook)

You don't poll. You subscribe. can When the workflow completes, we push the fully enriched JSON payload to your endpoint. (If you really need to poll, you can! Just use our GET endpoint).

typescript
1// POST /v1-alpha/subscriptions
2// Listen for completed transformations
3
4await fetch("[https://api.bem.ai/v1-alpha/subscriptions](https://api.bem.ai/v1-alpha/subscriptions)", {
5 method: "POST",
6 headers: { "x-api-key": process.env.BEM_API_KEY },
7 body: JSON.stringify({
8 name: "invoice-complete-listener",
9 type: "transform",
10 workflowName: "process-invoices-production",
11 webhookURL: "[https://api.your-company.com/webhooks/bem/invoices](https://api.your-company.com/webhooks/bem/invoices)"
12 })
13});

Extra Credit: Trust, but Verify

Shipping AI to production without regression testing is negligence.

Because bem treats functions as versioned primitives, you can run evaluations programmatically. Before promoting invoice-extractor-v2, you can use our /v2/functions/regression endpoint to replay historical data against the new version and compare accuracy metrics.

We also offer a specific endpoint, /v2/functions/review, which calculates the statistical confidence of your pipeline and estimates the human review effort needed to hit 99.9% accuracy.

Summary: Enterprise-Grade Reliability

Invoices are messy, but your architecture shouldn't be. By decoupling the extraction (Transform) from the grounding (Enrich), and orchestrating them in an event-driven workflow, you build a system that is:

  1. Embeddable: It fits into your existing event bus.
  2. Customizable: You define the schema, not us.
  3. Accurate: Data is validated against your own database via Collections.

Stop writing prompts. Start building pipelines.

Start to see it in action?

Talk to our team to walk through how bem can work inside your stack.

CTA accent 1CTA accent 2