Canonical bytes: why JSON betrays you when you try to hash it

TL;DR

Hashing JSON is easy until two correct systems produce different bytes for the same event. Key order, whitespace, number formatting, timestamps, Unicode, optional fields, and protocol unit conversions can all change the byte stream before a signer ever sees it. JouleBridge treats canonicalization as the first serious engineering step in the proof path: adapt the event, normalize the representation, produce deterministic bytes, sign those bytes, and make the verifier repeat the same work without trusting the server.

The problem in one hundred words

A signature over a random byte string proves that someone signed that byte string. It does not prove that anyone else can reproduce it.

That is the trap in signed energy events. A charger, a meter, and a verifier can all agree that the event means "meter M1 imported 19.42 kWh at 09:12 UTC" and still compute three different hashes. JSON makes this feel absurd because the objects look equivalent on screen. Cryptography does not sign what looks equivalent on screen. It signs bytes.

Canonicalization is the discipline that turns "same meaning" into "same bytes." Without it, signed evidence becomes expensive theater.

The betrayal starts with key order

This pair is the beginner bug. It is also the bug that appears in production when one adapter uses insertion order and another serializes a map after sorting.

Interactive figure / Canonical bytes

Canonicalization lab

Switch the source ambiguity. The verifier only gets a stable hash after the runtime collapses representation into deterministic bytes.

Ambiguity

Raw input

{
  "transactionId": 8042,
  "unit": "kW",
  "connectorId": 1,
  "messageType": "MeterValues",
  "chargePointId": "ocpp-gw-17",
  "measurand": "Power.Active.Import",
  "value": 42.7,
  "timestamp": "2026-05-16T09:12:00Z"
}

computing

Canonical bytes

{"chargePointId":"ocpp-gw-17","connectorId":1,"measurand":"Power.Active.Import","messageType":"MeterValues","timestamp":"2026-05-16T09:12:00.000Z","transactionId":8042,"unit":"kW","value":42.7}

computing

Real SHA-256 over raw input and canonical bytes. Non-canonical input changes the raw digest while canonical bytes preserve string data as parsed.

{
  "meter": "M1",
  "kwh": 19.42,
  "site": "Pune-depot-07"
}

{
  "site": "Pune-depot-07",
  "meter": "M1",
  "kwh": 19.42
}

Same event meaning, different object property order. A naive byte hash changes.

The runtime does not get to shrug here. If the JouleBridge gateway signs the left representation and a discom verifier rebuilds the right representation, the signature check fails even though nobody tampered with the reading.

This is why RFC 8785, the JSON Canonicalization Scheme, is useful. It constrains JSON to I-JSON, defines deterministic serialization, and sorts object properties. The standard describes JCS output as a hashable representation of JSON data. That phrase is not academic decoration. It is the whole point.

The field version is messier. Protocol adapters rarely hand over perfect JSON. Modbus gives registers. OCPP gives charger sessions with its own vocabulary. DLMS/COSEM has meter objects and scaler units. HTTP webhooks arrive with whatever the vendor thought was a good idea on a Friday. The canonicalizer is where those shapes become a runtime event.

Whitespace, decimals, and the lies we tell ourselves

Humans look past whitespace. Hash functions do not.

{
  "command": "start",
  "max_kw": 40,
  "policy": "tariff-v7"
}

{
  "command": "start",
  "max_kw": 40,
  "policy": "tariff-v7"
}

Application code may treat 40 and 40.0 as equivalent. A proof layer needs one numeric representation before signing.

The number problem is worse than it first looks. A site limit of 40 kW may arrive as 40, 40.0, "40", 40000 W, or 40.000000 after a spreadsheet export. Some representations are wrong for the schema. Some are acceptable inputs that need to collapse into one canonical value. Some should be rejected because precision was lost before the runtime saw the event.

JouleBridge has to be conservative. It should not silently accept every format because energy software already has enough polite lies. If a field is a decimal quantity with units, the adapter should say so. If a field is a string because the source cannot preserve precision, the event schema should say so. If a value cannot be normalized without changing meaning, the runtime should quarantine it and produce an error receipt.

That receipt matters. A rejected event is not a missing event. It is evidence that the runtime refused to make a false claim.

Timestamps and Unicode hide the bug differently

Timestamps and Unicode are the two places where the bug becomes insulting. The text looks the same. The bytes are not.

{
  "device": "meter-01",
  "location": "Bengaluru",
  "ts": "2026-05-16T09:12:00Z"
}

{
  "device": "meter-01",
  "location": "Bengaluru",
  "ts": "2026-05-16 09:12:00+00:00"
}

The timestamp names the same instant, but the byte representation differs. JouleBridge signs one normalized UTC form.

Timestamps deserve their own little circle of blame. Some devices report local time. Some report epoch seconds. Some report epoch milliseconds. Some include timezone offsets. Some omit them and ask the operator to enjoy the mystery. Tariff windows then depend on exactly those timestamps.

This is not a formatting issue when money is attached. If a kWh crosses a tariff boundary, the timestamp is part of the bill. A 10-minute drift is not a harmless observability gap. It is a settlement problem.

Unicode adds another edge. RFC 8785 explicitly avoids Unicode normalization after parsing. That is the right choice for a cryptographic canonicalizer, but it means upstream systems must preserve string data as-is and schema authors must avoid clever identifiers where byte-level equality becomes a cultural studies seminar.

Why this matters for signed systems

Programs usually work with data in at least two different representations.
Martin Kleppmann, Designing Data-Intensive Applications, Chapter 4, 2017

Kleppmann names two representations: the application's working model and the bytes that travel between systems. Every crossing is a place where meaning leaks. In a signed system the leak stops being theoretical; the verifier either recomputes the same hash or it does not.

The common lazy pattern is "just stringify it and sign." That works until another runtime, another language, another JSON library, or another numeric edge case enters the room. Then the signature path becomes a support ticket.

JouleBridge has to do the more boring thing. Define event schemas. Normalize units. Normalize timestamps. Reject ambiguous values. Serialize deterministically. Then sign.

Where naive JSON signing tends to fail

Qualitative failure pressure from JouleBridge proof-path design notes and RFC 8785 edge cases. Higher means more likely to break verifier agreement.

What RFC 8785 gives us

RFC 8785 does five jobs that matter here.

It removes insignificant whitespace. It serializes primitives using defined ECMAScript behavior. It constrains inputs to I-JSON, which avoids duplicate property names and unsafe number assumptions. It sorts object properties deterministically. It preserves parsed string data without a normalization pass that would mutate the original value.

That is enough to make JSON hashable. It is not enough to make an energy event correct.

This distinction matters because engineers sometimes treat a canonicalization standard as if it were a domain model. It is not. JCS can produce stable bytes for a JSON object. It cannot decide whether energy_imported_wh and kwh are equivalent, whether a meter scaler was applied, whether a timestamp came from the device clock or gateway clock, or whether an OCPP session counter wrapped after a reboot.

The domain canonicalizer sits above the byte canonicalizer. JouleBridge first turns protocol input into a typed event. Then it turns that typed event into canonical JSON bytes. Then it hashes and signs those bytes.

That layering keeps the system honest. If the typed event is wrong, JCS will faithfully produce stable bytes for the wrong thing. Cryptography is not quality control. It is a binding mechanism.

A field example from an energy site

Here is the kind of event that looks harmless in a dashboard and becomes dangerous in a proof system:

meter M1 reports 19.42 kWh imported during session S441 at 09:12 UTC

The adapter may receive that event as Modbus registers, an OCPP session field, a meter-head-end export, or a webhook from a vendor platform. The first question is not "can we show it on a chart?" The first question is "what exactly are we willing to sign?"

For JouleBridge, the signed event should not preserve the source mess. It should preserve the source meaning after declared adaptation. The event needs a site ID, device ID, source protocol, adapter version, measurement kind, normalized unit, numeric value, source timestamp, runtime timestamp, and a trace back to the raw observation. If a scaler was applied, the event should say which scaler. If the timestamp came from the gateway because the source device had no usable clock, the event should say that too.

This is where a lot of systems cheat. They store the prettiest value and throw away the ugly path that produced it. The product manager sees 19.42 kWh. The billing team sees 19.42 kWh. The engineer debugging a dispute sees a number with no ancestry. That is not evidence. That is a screenshot with confidence.

The canonical event has to be meaner than that. It should make the adapter confess. The source said this. The runtime interpreted it this way. The unit changed here. The timestamp was trusted to this level. The event hash covers this representation. The signature binds this signer. The chain puts it after this previous event. The policy gate used this bundle.

Once that record exists, the UI can become pleasant. Without that record, the UI is trying to launder uncertainty into a neat line chart.

The practical design rule is simple: never sign a value whose journey you cannot explain. If the runtime cannot explain the journey, it should produce a rejection receipt and keep moving. Operators can tolerate a flagged event. They cannot tolerate a proof layer that quietly invents certainty.

The Bridge Kernel canonicalizer

The shape I want in Bridge Kernel is small enough to explain and strict enough to be annoying:

canonicalize(event: RuntimeEvent, schema: EventSchema, policy: CanonPolicy) -> CanonicalBytes

RuntimeEvent is the typed event after the adapter has parsed the source protocol. EventSchema declares required fields, allowed units, numeric representation, timestamp rules, and protocol-specific mappings. CanonPolicy declares whether the runtime may coerce safe inputs, which ambiguous inputs must be rejected, and how to attach rejection receipts.

The output is not a pretty object. It is signing material. It needs a schema version, canonical event bytes, event hash, previous chain hash, timestamp attestation, and enough metadata for the verifier to repeat the same process later.

The verifier should not import the full JouleBridge server. It should import the canonicalization rules, load the evidence pack, rebuild the bytes, recompute the hash, and verify the signature. If it cannot do that offline, the evidence pack is asking for trust instead of producing proof.

This is also why the canonicalizer should be boring code. Fancy abstraction here is usually a smell. The job is to accept, adapt, or reject with extreme clarity. If an engineer needs an architectural diagram to understand why 19.42 kWh became 19420 Wh, the code is already too clever.

a valid transaction can be modified in-flight, without invalidating it
Pieter Wuille, BIP 62: Dealing with malleability, 2014

Wuille was describing Bitcoin transaction malleability, but the warning lands cleanly in any signed evidence system. If the bytes can change while the claim remains valid to application logic, the proof layer has failed. JouleBridge has to remove that space: stable bytes, explicit schema versions, and refusal behavior for fields the verifier cannot safely interpret.

Edge cases that broke the clean story

The clean story is "normalize, hash, sign." The real story has a longer bug list.

Integer overflow is the obvious one. Energy counters can grow. Device registers can wrap. A signed event must distinguish between a reset, a wrap, and a real drop. If a meter goes from 4,294,967,290 Wh to 24 Wh, the runtime should not sign a negative monthly bill and call it innovation.

Locale-dependent parsing is the embarrassing one. A comma can be a thousands separator or a decimal separator depending on the source. If the adapter accepts both without a source rule, it is guessing with financial consequences.

Timezone handling is the recurring one. The event timestamp, gateway timestamp, ingestion timestamp, and policy-window timestamp are not interchangeable. I want the proof envelope to name the timestamp source because a verifier should know whether it is checking device time or runtime time.

Payload size limits are the quiet one. If a vendor event includes a giant diagnostic blob, the canonicalizer should not let the proof path become a memory denial-of-service test. Large payloads need hashing, references, and explicit inclusion rules.

Optional fields are the political one. Product teams love optional fields because they keep demos moving. Verifiers hate optional fields because absence can mean "not applicable," "not collected," "lost," "redacted," or "nobody thought about it." The schema should force that distinction.

What hashes do not prove

A hash proves equality of bytes. It does not prove truth.

If the device lied, a hash preserves the lie. If the adapter applied the wrong scaler, a hash preserves the wrong value. If the gateway clock drifted, a hash preserves the bad timestamp. If the policy bundle allowed a bad command, a hash preserves the bad decision. This is why signed evidence systems need multiple layers.

Canonicalization gives repeatable bytes. A signature binds those bytes to a key. A chain binds the event to an order. A policy gate binds the event or command to the rule that allowed or rejected it. An evidence pack binds a time window to a reviewable export. The verifier checks the whole chain of claims.

What each proof layer can actually claim

Qualitative claim strength by layer. Canonical bytes are necessary but weak alone; signatures, chains, and policy receipts create audit value.

The funny thing is that the byte layer is still where many proof systems fail. People want to talk about zero knowledge, blockchains, TEEs, and AI auditors before they can make two services hash the same object. That is how you get a white paper with a consensus protocol and a production system that cannot survive 1 versus 1.0.

Compared to COSE, JWS, and Bitcoin

COSE, specified in RFC 9052, is a serious answer when the payload and signature package should be compact and binary. It builds on CBOR, which was designed for small code size and small message size. That matters for embedded and edge systems. JouleBridge may eventually use COSE-style packaging for proof envelopes, especially when packs need cleaner algorithm metadata and compact transport.

JWS, specified in RFC 7515, is the more familiar JSON-era signature container. It is widely implemented and useful, but it does not remove the need to decide what payload bytes are being signed. A JWS over unstable JSON is still unstable at the layer that matters.

Bitcoin took a different route: design a transaction serialization format and make consensus depend on exact bytes. The lesson is not "use Bitcoin." Please don't turn an EV depot into a token experiment because a slide needed texture. The lesson is that serious distributed systems define their bytes before they define their claims.

JouleBridge does not need public consensus for every meter read. It needs local verifiability, exportable evidence, and deterministic replay. That is a smaller and more useful problem.

What I would do differently

If I were starting the proof path again, I would write the verifier before the dashboard.

The dashboard makes the system feel alive. The verifier makes the system honest. When the verifier exists first, every feature has to answer a harsher question: what bytes will the external reviewer recompute, and what evidence will they need?

I would also make schema versioning boring from day one. Every event should carry a schema ID. Every evidence pack should name the canonicalization version. Every policy bundle should be signed and referenced by hash. Upgrades should produce transition records. Nothing should depend on oral history.

The last lesson is cultural. Canonicalization looks like plumbing, so teams postpone it. They want to ship the feature, then add proof later. That order is backwards for infrastructure. The proof layer shapes the feature. If you add it later, you discover that half the data you need was never captured and the other half was captured in whatever form made the UI easy.

Energy sites are about to get more software, more automation, more tariffs, more AI proposals, and more disputes. The canonical bytes are not the glamorous part. They are the part that lets every other claim stand up.

Search