Storing spatial memory in a PNG: how we made visual UI matching 6.5× faster in OculiX

Ce contenu n’est pas encore disponible dans votre langue.

19 mai 2026

This is a long technical post about a small change. It explains a measurement we never thought to make, the discovery it produced, the implementation we built around it, and what it means for visual automation suites running in CI.

The headline number is real but not particularly dramatic: visual UI matching in OculiX became roughly 6.5 times faster, measured on 50 cold-start JVM runs on an Intel i3 7th generation laptop. The interesting part is not the speedup itself. It is the path that led to it, and the structural lesson that came with it.

If you maintain or use a visual automation framework, a test suite that watches pixels, or any tool that needs to find an image inside another image repeatedly across separate process invocations, the pattern we describe here will probably apply to you. The implementation took about 200 lines of Java, no external dependencies, and three micro-modifications inside the existing codebase. The standards we relied on have been published since 1996.

What follows is the full story, ordered the way the investigation actually unfolded, not the way a marketing post would tell it.

The benchmark we never made

OculiX is the active continuation of the Sikuli and SikuliX visual automation lineage, MIT-licensed and used in production by close to 100 organizations across banking, defense, healthcare, manufacturing, and retail. Its core operation is a function called find: given a small image (a button, an icon, a region of UI), find it inside the current screen capture and return its coordinates. Every other operation in the public API contains at least one call to find underneath.

Inside the codebase, find is well understood. It uses OpenCV template matching via JNI bindings, with five fallback strategies cascaded inside Finder.java:

Mode 1: Standard match

Exact template matching with the configured similarity threshold. The fast path for stable UI elements.

Mode 2: DPI-aware rescale

If the screen DPI differs from the pattern capture DPI, the template is rescaled before matching.

Mode 3: Tolerant blur

GaussianBlur applied to both source and target. Tolerates antialiasing and subtle color variations.

Mode 4: Grayscale smart

Conversion to grayscale before matching. Tolerates color theme changes.

Mode 5: Multi-scale brute force

Last resort. Tries multiple scales (0.5x to 2x) to catch significantly resized elements.

The code is twenty years old at this point, refined incrementally by the original Sikuli authors at MIT in 2009, by Raimund Hocke from 2010 to 2025 under SikuliX, and now by the OculiX maintainers.

Performance, in this kind of codebase, is rarely benchmarked from scratch. Everyone assumes it is whatever it has always been. A find call takes “some milliseconds”, or “a few hundred milliseconds” on a slow machine, and life goes on.

So I sat down on a Sunday afternoon to actually measure it. Not because there was a problem. Because the question of how fast it really was had never been answered with a number on the current hardware I had in front of me.

The harness

The setup was deliberately simple. A standalone Java class called FindTiming, compiled directly against the OculiX complete-win jar, performing exactly one find per JVM invocation, then exiting. A batch script wrapping that class in a loop of fifty separate executions.

So fifty cold starts. Each one paid the JVM startup cost, the OpenCV library load, the Tesseract OCR engine load, the OculiX framework initialization, then performed exactly one find and reported the elapsed time before exiting.

The baseline

The result, on this five-year-old i3 laptop, was extremely consistent:

Metric	Value
Mean	502 ms
Median	502 ms
Minimum	480 ms
Maximum	545 ms
Range	65 ms
Standard deviation	18 ms

Half a second to find a small image on a 1920 by 1080 screen. On a modern Intel i7 or Apple Silicon machine the number would be lower, probably by a factor of two or three. On a GitHub Actions standard runner it would be roughly comparable to the i3. On an older corporate desktop running a Citrix client through a VPN it would be slower again.

Take 500 milliseconds and project it across a test suite:

Suite scale	Find calls	Pure find time @ 500ms
Small functional suite	200	100 seconds
Medium regression suite	1 000	8 min 20 s
Large nightly suite	5 000	41 min 40 s
Enterprise full coverage	20 000	2 h 46 min

Multiply by the number of suites running per day in a CI environment, multiply by the number of CI minutes billed by the runner provider, and the cost becomes very real.

That was the baseline. No optimization had been attempted yet. The number was simply the truth about what happened on the metal.

What was already in the code

Before deciding what to optimize, the right move is always to look at what the code already does. OculiX inherits sixteen years of optimization attempts from the Sikuli and SikuliX lineage. A naive optimizer would re-read the OpenCV documentation and propose to switch to a faster matching algorithm. That would be a beginner mistake.

The thing to investigate first is whether there is some optimization the existing code already attempts but cannot fully complete in the current configuration. That is almost always where the gains hide.

A few minutes of grep revealed a symmetric pair of fields and a setting that nobody seems to talk about:

Image.lastSeen

Private Rectangle field on every Image object. Stores the position of the most recent successful match. Paired with getLastSeen() and setLastSeen(rect, score) accessors.

Settings.CheckLastSeen

Public static boolean, set to true by default since at least 2018. Enables the optimization at the framework level.

checkLastSeenAndCreateFinder

Private method in Region.java. When called, creates a Finder restricted to the small rectangle around the previous match, falling back to full-screen only if needed.

The intent of these three pieces is clear once you piece them together. After a successful find, the rectangle of the match is stored in the Image object’s lastSeen field. On the next call to find for the same image, if Settings.CheckLastSeen is true and lastSeen is non-null, the code creates a Finder restricted to that small rectangle and tries to match there first. Only if the small-region match fails does it fall back to a full-screen scan.

This is a classic spatial memoization pattern, well-known in computer vision. Sikuli implemented it correctly, a long time ago.

The pattern works, until it doesn’t

The optimization works beautifully inside a single JVM session. If you run a script that performs screen.find("submit_button.png") ten times in a row, the first call pays the full-screen scan cost, but the next nine calls find the image almost instantly through checkLastSeen. The cache hit rate is essentially 100 percent on stable UIs.

There is, however, a subtle but critical limitation.

This is exactly what our benchmark exposed. Fifty separate JVM invocations, fifty Image objects with lastSeen always equal to null at the moment of the find call, fifty full-screen scans. The checkLastSeen optimization was active and present in the code throughout, but it had no input to work with. The cache was empty because the cache lived inside a process that died right after building it.

This is the core observation. The existing optimization was correct in design. It was simply unable to bridge the gap between JVM invocations. In a typical test environment, where each test is its own process, the optimization never had a chance to engage.

The code that solved the problem was already there, written long before this benchmark was performed, in a careful and well-tested form. The missing piece was not a clever algorithm. It was a way to keep the optimization’s input alive across process boundaries. A storage problem, not a computation problem.

The missing piece: persistence

Once the gap was identified, the question became: how do you persist Image.lastSeen between JVM invocations? Several candidate approaches surfaced, each with their own trade-offs.

Approach	Description	Drawbacks
Sidecar file	Write `foo.png.position` next to each PNG	Two files to commit, risk of desynchronization, clutter
Central project file	Single `.oculix-positions.toml` at project root	Linear lookup cost, merge conflicts in parallel CI, full file rewrite per change
PNG ancillary chunk	Embed the position metadata inside the PNG itself	Requires writing PNG-aware code, but standard since 1996

The PNG ancillary chunk option won, for reasons that became clearer as we explored the constraint of CI environments.

PNG ancillary chunks: the underused W3C standard

The PNG file format, standardized by the W3C in 1996, is a structured container made of a fixed signature followed by a sequence of chunks. Each chunk has a four-byte length, a four-byte type, a variable-length data payload, and a four-byte CRC32 of the type and data.

The PNG standard distinguishes between critical chunks (IHDR, IDAT, IEND, and others), which are mandatory for decoding the image, and ancillary chunks, which are optional and can be safely ignored by decoders that do not recognize them.

We chose the type code oPLx for our chunk:

Lowercase o: ancillary, not critical
Uppercase P: private to OculiX
Uppercase L: reserved bit
Lowercase x: safe to copy

Decoded: an optional, private, safe-to-copy chunk identified by oPLx. Other PNG tools encountering an OculiX-modified file see something they do not recognize, preserve it untouched on save, and ignore it on load. Compatibility is total.

The format of the oPLx chunk

The internal layout of the oPLx chunk’s data payload is deliberately small. The goal was a fixed-size, fast-to-parse, easy-to-debug binary structure. The total payload is exactly 34 bytes:

Offset	Size	Type	Field	Notes
0	4	ASCII	Magic `OPL\0`	Redundant identifier, prevents misinterpretation
4	2	uint16 (BE)	Version	Currently `1`. Bump on breaking format change.
6	4	int32 (BE)	X	Pixel coordinate, last successful match
10	4	int32 (BE)	Y	Pixel coordinate, last successful match
14	4	int32 (BE)	Width	Match rectangle width, pixels
18	4	int32 (BE)	Height	Match rectangle height, pixels
22	8	int64 (BE)	Timestamp	UNIX epoch milliseconds, last update
30	4	int32 (BE)	Run count	Total successful matches since file creation

Total: 34 bytes of payload, plus the standard 12 bytes of PNG chunk framing (4 bytes length, 4 bytes type, 4 bytes CRC32). Each pattern gains 46 bytes of metadata embedded in its PNG file. For a project with 500 patterns, this represents 23 kilobytes of additional space across the entire pattern library. Effectively negligible.

Design choices in this format

A few decisions deserve commentary.

Big-endian byte order

Non-negotiable. The PNG standard mandates big-endian for all multi-byte integers in chunk fields. Following the same convention inside our payload simplifies parsing and removes confusion with future tooling.

32-bit signed for coordinates

Generous. 16-bit unsigned would have sufficed for single-screen resolutions. We chose 32 bits to leave room for multi-monitor setups where coordinates extend into negative space and tens of thousands of pixels.

64-bit timestamp

Standard UNIX epoch milliseconds. No bytes saved here. Audit trails that span years require room.

Run counter

Allows detecting dead patterns (counter at zero), unstable patterns (high counter on young file), and locked-in patterns (counter grows without timestamp updates).

The chunk is plaintext at this stage. The integrity check is the standard CRC32 that PNG mandates at the end of every chunk; it catches accidental corruption but is not cryptographically strong.

Implementation: three micro-modifications

The actual change inside OculiX comes down to three modifications in two existing files, plus one new utility class. Total addition: roughly 250 lines of Java. No external dependencies. No new Maven coordinates. The standard JDK classes DataInputStream, DataOutputStream, ByteBuffer, and java.util.zip.CRC32 cover all the needs.

1. Image.load() reads the chunk

After ImageIO.read() decodes the pixels, a separate streaming read parses the PNG chunks, locates oPLx, and calls setLastSeen(rect, 1.0) on the current Image instance.

2. doCheckLastSeenAndCreateFinder expands the search

Instead of creating a Region exactly the size of the previous match, it now creates a search box 2.5x larger, clamped to screen bounds. Tolerates UI drift between runs.

3. find() writes the chunk after match

After updating in-memory lastSeen, the chunk in the PNG file is streamed through a temp file and atomic-renamed. The position persists to disk.

Modification 1: Image.load() reads the chunk

In Image.java, the existing load method reads the PNG file from disk into a BufferedImage using ImageIO.read. This call decodes only the image data (the IDAT chunks). It does not parse other chunks. Our addition opens the same file separately, in streaming mode, walks through its chunks until it finds the oPLx chunk if present, parses the position metadata, and calls setLastSeen(rect, 1.0) on the current Image instance.

// In Image.java, after ImageIO.read(fileURL) at line 1018:
try {
    File pngFile = new File(fileURL.toURI());
    byte[] chunk = PngChunk.read(pngFile, "oPLx");
    if (chunk != null && chunk.length >= 34) {
        ByteBuffer buf = ByteBuffer.wrap(chunk);
        byte[] magic = new byte[4];
        buf.get(magic);
        if (magic[0] == 'O' && magic[1] == 'P'
            && magic[2] == 'L' && magic[3] == 0
            && buf.getShort() == 1) {
            int cx = buf.getInt();
            int cy = buf.getInt();
            int cw = buf.getInt();
            int ch = buf.getInt();
            setLastSeen(new Rectangle(cx, cy, cw, ch), 1.0);
        }
    }
} catch (Exception ignored) {}

The streaming parser is deliberately optimized for the common case where the chunk is present and is one of the first non-critical chunks in the file. It skips the PNG signature (8 bytes), then enters a loop: read the chunk length, read the four-byte chunk type, compare it to oPLx, return the payload if matched, return null if IEND is reached, or skip the chunk data and CRC and continue otherwise.

This is the architectural detail that matters most. The chunk reading is a strict addition that fills a gap in the existing system. It does not replace anything. It does not modify the contract of any existing method. If the chunk is absent (a legacy PNG that has never been processed by OculiX, a PNG whose chunk was stripped by an aggressive optimizer, a PNG generated by an external tool), the code falls through to lastSeen being null, which is exactly the situation the existing codebase has handled for sixteen years. The fall-through path is the cold-start path. It still works. It is just slower.

Modification 2: doCheckLastSeenAndCreateFinder expands the search

In Region.java, the existing method creates a small Region exactly the size of the previous match rectangle. This works well for in-session use, where the image has just been matched at the exact same position. It works less well for cross-process use, where the UI may have drifted slightly between runs.

Our modification expands the search rectangle by a factor of 2.5 around the stored center, clamped to the screen bounds.

// Replace at line 2891:
Rectangle ls = img.getLastSeen();
int sw = (int) (ls.width * 2.5);
int sh = (int) (ls.height * 2.5);
if (sw > screen.w) sw = screen.w;
if (sh > screen.h) sh = screen.h;

int cx = ls.x + ls.width / 2;
int cy = ls.y + ls.height / 2;
int sx = cx - sw / 2;
int sy = cy - sh / 2;

// Translate to stay inside screen, do not truncate
if (sx < 0) sx = 0;
if (sy < 0) sy = 0;
if (sx + sw > screen.w) sx = screen.w - sw;
if (sy + sh > screen.h) sy = screen.h - sh;

Region r = Region.create(sx, sy, sw, sh);

The 2.5 multiplier was chosen empirically:

Multiplier	Effect
1.5x	Occasionally misses drifted patterns
2.0x	Marginal improvement, still occasional misses
2.5x	Sweet spot: tolerates drift without ambiguity
3.0x	No safety improvement, starts introducing ambiguity on dense UIs
4.0x	Multiple visually similar elements get reached, confusing the matcher

The clamping logic preserves the search box size when the pattern is near a screen edge. A pattern at (0, 1022) still gets a full 250 by 145 search box, just positioned at (0, 935) instead of being centered. The match rectangle for the original pattern still fits inside the search box.

Modification 3: find() writes the chunk after a successful match

In Region.java, the existing find method already calls img.setLastSeen(lastMatch.getRect(), lastMatch.getScore()) after a successful match. This updates the in-memory lastSeen field. Our addition extends this call with a write to the PNG file’s oPLx chunk.

// At line 2284 of Region.find(), after setLastSeen:
img.setLastSeen(lastMatch.getRect(), lastMatch.getScore());

// New: persist to PNG chunk
try {
    File pngFile = new File(img.getFileURL().toURI());
    ByteBuffer buf = ByteBuffer.allocate(34);
    buf.put((byte) 'O').put((byte) 'P').put((byte) 'L').put((byte) 0);
    buf.putShort((short) 1);
    Rectangle r = lastMatch.getRect();
    buf.putInt(r.x).putInt(r.y).putInt(r.width).putInt(r.height);
    buf.putLong(System.currentTimeMillis());
    buf.putInt(getRunCount(img) + 1);
    PngChunk.write(pngFile, "oPLx", buf.array());
} catch (Exception ignored) {}

The chunk-writing logic streams through the PNG file once, copying every chunk through to a temporary file, replacing the oPLx chunk in place if it already exists, or inserting a fresh oPLx chunk before the IEND marker if not. At the end, the temporary file replaces the original via an atomic file system rename.

This streaming approach is more complex than reading the whole file into memory, modifying the byte array, and writing the result back, but it is meaningfully more robust:

Property	In-memory	Streaming (chosen)
Memory cost	Proportional to PNG size	Constant (~8 KB buffer)
Crash safety	Risk of partial write	Atomic rename: old or new, never half
Cost on small PNG	< 1 ms	1-2 ms
Cost on large PNG (1 MB+)	20-50 ms + allocation	4-8 ms, no allocation spike

The PngChunk utility class

Reading and writing PNG chunks does not require an external library. The format is simple enough that a 200-line utility class handles both operations with zero dependencies. The class exposes two methods:

public static byte[] read(File png, String type) throws IOException

Returns the payload bytes if the chunk is found, null otherwise. The read uses streaming DataInputStream, with skip() to bypass non-target chunks. Exits as soon as the target chunk is located.

public static void write(File png, String type, byte[] payload) throws IOException

Streams the source PNG to a temp file, replacing or inserting the chunk, then atomic-renames the temp file over the original. CRC32 computed via java.util.zip.CRC32 (hardware-accelerated on modern CPUs).

private static final byte[] SIGNATURE = {
    (byte) 0x89, 'P', 'N', 'G', '\r', '\n', 0x1a, '\n'
};
private static final byte[] IEND_BYTES = { 'I', 'E', 'N', 'D' };

The PNG file signature is 8 well-known bytes. The IEND chunk type terminates every PNG.

The total code in PngChunk.java is 218 lines including comments, blank lines, and the class declaration. The file has no imports outside the java.io, java.nio, java.util.Arrays, java.util.zip.CRC32, and java.nio.charset.StandardCharsets packages, all standard JDK.

Benchmark methodology

A benchmark is only useful if its methodology is described in enough detail that someone else can reproduce it.

Hardware and runtime

Item	Value
CPU	Intel Core i3-7100 (2 cores, 4 threads, 3.9 GHz)
RAM	8 GB DDR4-2400
Storage	NVMe SSD
OS	Windows 10 build 19045
Java	OpenJDK 25 from Eclipse Temurin
OculiX	3.0.3 release, feature branch with modifications
Screen	1920 × 1080, no DPI scaling
Pattern	Windows search bar fragment, 12 × 58 pixels, at (0, 1022)

Harness logic

Each benchmark run consisted of fifty independent JVM invocations, each:

Starting a fresh process
Loading OculiX and its native dependencies
Performing exactly one find call against the screen
Printing the elapsed time in milliseconds to standard output
Exiting

The elapsed time was measured using System.nanoTime() immediately before and after the screen.find(pattern) call. This excludes JVM startup, library loading, and framework initialization. It includes only the find operation itself, including the screen capture inside the find.

A separate post-processing class read the fifty timing lines from the captured log and computed: arithmetic mean, median, minimum, maximum, range, and standard deviation.

Two scenarios measured

Scenario	Initial state of PNG	Expected behavior
Baseline	No `oPLx` chunk	All 50 runs in full-screen scan
Optimized	`oPLx` chunk pre-written	All 50 runs in small-region scan

Both scenarios were measured cold-start (50 separate JVM invocations) to ensure the comparison reflects the CI environment.

Results

The numbers are the headline of this post. Let me state them precisely.

Metric	Baseline (FULL)	Optimized (ROI)	Improvement
n	50	50	—
Mean	502 ms	77.7 ms	×6.46
Median	502 ms	77 ms	×6.5
Minimum	480 ms	63 ms	×7.6
Maximum	545 ms	113 ms	×4.8
Range	65 ms	50 ms	comparable
Standard deviation	18 ms	11 ms	tighter

Speedup factor on the find call alone: 6.46.

Observations on these numbers

Low variance in both conditions

Standard deviation is 4% of mean in baseline, 14% in optimized. Neither is noisy enough to require repeated measurement.

Optimized scenario has a floor

Roughly 60-70 ms cannot be reduced further by this technique. Dominated by screen capture (20-40 ms) and OpenCV setup (20-40 ms).

Speedup depends on pattern size

Small patterns (under 50×50 px) give the largest speedup. Large patterns (300×300+ px) give a smaller speedup because the 2.5x search region itself becomes substantial.

Faster hardware preserves ratio

On modern CPUs, absolute timings shrink but the relative speedup factor stays at ×6-8. The overhead floor is proportionally less significant.

Hardware projection

If we extrapolate to other typical hardware:

Hardware	Baseline (FULL)	Optimized (ROI)	Speedup
i3 7th gen (measured)	502 ms	77 ms	×6.5
i7-13700K (projected)	~150-180 ms	~22-25 ms	×8
Apple M2/M3 (projected)	~80-120 ms	~10-15 ms	×8
GitHub Actions standard runner (projected)	~200-250 ms	~28-35 ms	×7

What this means for CI suites

The benchmark measures one find operation in isolation. A real-world test suite performs many operations, each containing at least one find. Translating the per-operation speedup into a suite-level wall-clock improvement requires a few assumptions.

Consider a moderately sized regression suite: 100 test cases, each performing 20 visual interactions on average, for 2000 total find calls.

Scenario	Time per find	Total find time	Savings
Baseline (i3)	500 ms	16 min 40 s	—
Optimized (i3)	77 ms	2 min 34 s	14 min 6 s
Optimized (modern i7)	22 ms	44 seconds	15 min 56 s

Economic projection

If the suite runs once per pull request on a CI runner billed at $0.008 per minute (the rough GitHub Actions standard runner cost):

Activity	Cost per run (baseline)	Cost per run (optimized)	Annual saving (50 PR/day, 250 days)
Suite execution	$0.13	$0.02	~$1 375 per suite

For organizations running multiple suites in parallel across many projects, the saving compounds quickly.

Two caveats

Find is not the only cost

The speedup applies to find time only. The full suite also contains network calls, server processing, browser rendering, mouse and keyboard events, and waits. A suite where finds dominate (visual regression, end-to-end smoke) sees larger wall-clock improvement than a suite waiting on slow back-ends.

Optimization rewards stability

Teams that regenerate patterns frequently see the chunk reset on each regeneration. The first run after regeneration pays the full-screen scan cost. The optimization rewards UI stability and frequent test execution — both characteristics of mature codebases.

Why this survives git clone and CI runners

The structural argument for embedding the position in the PNG itself, rather than in a sidecar file or a central index, is best expressed by considering the CI lifecycle.

A CI build typically starts from a clean state. A fresh container is provisioned, the repository is cloned from origin, dependencies are installed, the build runs, the tests run, the artifacts are collected, and the container is destroyed. There is no persistent state between builds. There is no shared file system. There is no cache that can be relied upon to be hot.

The oPLx chunk inside the PNG removes this discipline burden entirely. The chunk is part of the PNG file itself. When a developer runs the test suite locally and the chunk is updated, the modified PNG appears in git status automatically. Git tracks PNG files because they are committed in the project. The developer cannot commit “the update” without committing “the chunk inside the update”, because they are the same file.

This means that when a CI build clones the repository, it gets the PNG file with the chunk already inside. The first test run in CI is already in the fast path. The optimization is active from the first second. No warm-up phase is needed. No cache must be primed. The position metadata is in the file, and the file is in the repository, and the repository is in the clone.

A team running the suite locally, committing the regenerated PNGs, and pushing to CI is effectively pre-warming the CI cache through normal development activity. There is no separate “cache warm-up” step. The cache warms itself as a side effect of using the tool.

The broader pattern: PNG chunks as runtime context

The technique of embedding metadata in image files is not new. Adobe XMP, EXIF, IPTC, GIMP layer information, and many other systems use this mechanism. What is new, or at least underused, is the embedding of runtime state — not authoring metadata, not provenance information, but actual operational data that evolves as the file is used.

This pattern generalizes beyond visual automation:

Texture caches for graphics applications

A texture file could embed the time it was last loaded, the GPU memory pool it was allocated from, and the average frame time when it was active. A renderer could use this metadata to predict and pre-load textures.

Build artifacts for incremental compilation

A compiled object file could embed the hash of the source it was compiled from, the compiler version, and the optimization level. An incremental build tool could detect when recompilation is actually necessary.

Machine learning datasets

An image used in a vision model training set could embed its last classification result, the confidence score, and the model version. A cleaning tool could identify mislabeled samples without re-running the model.

Audit logs for compliance

A document image stored in a regulated workflow could embed a signed audit trail of the operations performed on it. The Ed25519 signing pattern from the OculiX MCP module would apply directly.

The common thread across all these examples is the same: keep operational state with the artifact, in a format the artifact’s primary tools will preserve, so that the state survives every distribution mechanism the artifact ever encounters.

What this doesn’t solve

A 6.5× speedup on the find is significant but not exhaustive. Several costs remain untouched by this optimization:

Cost component	Status	Approximate weight
Screen capture	Unchanged	20-40 ms per find
OpenCV setup	Unchanged	20-40 ms per find
Tesseract OCR loading	Unchanged	~300 ms per JVM cold start
OculiX framework init	Unchanged	~1.5 seconds per JVM cold start
JVM startup itself	Unchanged	~1-2 seconds per cold start

Reducing these other costs would require deeper structural changes: pre-built native images via GraalVM (the JVM startup), incremental OpenCV initialization (the OpenCV setup), lazy OCR loading (Tesseract), and possibly an alternative screen capture API on each platform. Each of these is a separate optimization project, none of which is in scope for the current change.

Closing: context engineering as a discipline

The lesson of this whole investigation is not “PNG chunks are cool” or “spatial memoization wins”. Both of those are true but uninteresting on their own.

The lesson is structural.

Finding these gaps requires a specific habit: benchmarking systems in the configuration users actually experience, not in the configuration that is convenient for developers. In a tight in-process loop, the OculiX find is fast. In a cold-start CI environment, it is slow. The same algorithm, the same code path, the same operating system. The only difference is whether the context built by the previous match has survived to inform the next match.

The fix is rarely a new algorithm. It is usually a new place to store something that already existed, in a form that can travel through the boundaries the user’s actual workflow imposes.

The PNG ancillary chunk happens to be a particularly elegant place to store visual automation context, because the artifact whose context we want to preserve is itself a PNG file, and the standard that defines PNG includes a mechanism specifically designed for this kind of metadata, and the tooling ecosystem has respected that mechanism for thirty years.

The result is an optimization that did not require inventing a new algorithm, did not require breaking any existing contract, did not require any external dependency, and is fully backward compatible with files that do not yet have the chunk. It just needed someone to measure the actual baseline, identify the gap, and close it.

For anyone maintaining a similar tool, my parting suggestion would be: go measure your cold-start performance against your warm-cache performance, in the same configuration your users actually run. If the gap is large, the optimization opportunity is probably not where you think it is. It is probably in the space between two of your existing components, in the form of a state that briefly exists and then disappears.

Repository: github.com/oculix-org/Oculix

Issue tracking the implementation of the persistent locator chunk: oculix-org/Oculix#353