Mode 1: Standard match
Exact template matching with the configured similarity threshold. The fast path for stable UI elements.
This is a long technical post about a small change. It explains a measurement we never thought to make, the discovery it produced, the implementation we built around it, and what it means for visual automation suites running in CI.
The headline number is real but not particularly dramatic: visual UI matching in OculiX became roughly 6.5 times faster, measured on 50 cold-start JVM runs on an Intel i3 7th generation laptop. The interesting part is not the speedup itself. It is the path that led to it, and the structural lesson that came with it.
If you maintain or use a visual automation framework, a test suite that watches pixels, or any tool that needs to find an image inside another image repeatedly across separate process invocations, the pattern we describe here will probably apply to you. The implementation took about 200 lines of Java, no external dependencies, and three micro-modifications inside the existing codebase. The standards we relied on have been published since 1996.
What follows is the full story, ordered the way the investigation actually unfolded, not the way a marketing post would tell it.
OculiX is the active continuation of the Sikuli and SikuliX visual automation lineage, MIT-licensed and used in production by close to 100 organizations across banking, defense, healthcare, manufacturing, and retail. Its core operation is a function called find: given a small image (a button, an icon, a region of UI), find it inside the current screen capture and return its coordinates. Every other operation in the public API contains at least one call to find underneath.
Inside the codebase, find is well understood. It uses OpenCV template matching via JNI bindings, with five fallback strategies cascaded inside Finder.java:
Mode 1: Standard match
Exact template matching with the configured similarity threshold. The fast path for stable UI elements.
Mode 2: DPI-aware rescale
If the screen DPI differs from the pattern capture DPI, the template is rescaled before matching.
Mode 3: Tolerant blur
GaussianBlur applied to both source and target. Tolerates antialiasing and subtle color variations.
Mode 4: Grayscale smart
Conversion to grayscale before matching. Tolerates color theme changes.
Mode 5: Multi-scale brute force
Last resort. Tries multiple scales (0.5x to 2x) to catch significantly resized elements.
The code is twenty years old at this point, refined incrementally by the original Sikuli authors at MIT in 2009, by Raimund Hocke from 2010 to 2025 under SikuliX, and now by the OculiX maintainers.
Performance, in this kind of codebase, is rarely benchmarked from scratch. Everyone assumes it is whatever it has always been. A find call takes “some milliseconds”, or “a few hundred milliseconds” on a slow machine, and life goes on.
So I sat down on a Sunday afternoon to actually measure it. Not because there was a problem. Because the question of how fast it really was had never been answered with a number on the current hardware I had in front of me.
The setup was deliberately simple. A standalone Java class called FindTiming, compiled directly against the OculiX complete-win jar, performing exactly one find per JVM invocation, then exiting. A batch script wrapping that class in a loop of fifty separate executions.
So fifty cold starts. Each one paid the JVM startup cost, the OpenCV library load, the Tesseract OCR engine load, the OculiX framework initialization, then performed exactly one find and reported the elapsed time before exiting.
The result, on this five-year-old i3 laptop, was extremely consistent:
| Metric | Value |
|---|---|
| Mean | 502 ms |
| Median | 502 ms |
| Minimum | 480 ms |
| Maximum | 545 ms |
| Range | 65 ms |
| Standard deviation | 18 ms |
Half a second to find a small image on a 1920 by 1080 screen. On a modern Intel i7 or Apple Silicon machine the number would be lower, probably by a factor of two or three. On a GitHub Actions standard runner it would be roughly comparable to the i3. On an older corporate desktop running a Citrix client through a VPN it would be slower again.
Take 500 milliseconds and project it across a test suite:
| Suite scale | Find calls | Pure find time @ 500ms |
|---|---|---|
| Small functional suite | 200 | 100 seconds |
| Medium regression suite | 1 000 | 8 min 20 s |
| Large nightly suite | 5 000 | 41 min 40 s |
| Enterprise full coverage | 20 000 | 2 h 46 min |
Multiply by the number of suites running per day in a CI environment, multiply by the number of CI minutes billed by the runner provider, and the cost becomes very real.
That was the baseline. No optimization had been attempted yet. The number was simply the truth about what happened on the metal.
Before deciding what to optimize, the right move is always to look at what the code already does. OculiX inherits sixteen years of optimization attempts from the Sikuli and SikuliX lineage. A naive optimizer would re-read the OpenCV documentation and propose to switch to a faster matching algorithm. That would be a beginner mistake.
The thing to investigate first is whether there is some optimization the existing code already attempts but cannot fully complete in the current configuration. That is almost always where the gains hide.
A few minutes of grep revealed a symmetric pair of fields and a setting that nobody seems to talk about:
Image.lastSeen
Private Rectangle field on every Image object. Stores the position of the most recent successful match. Paired with getLastSeen() and setLastSeen(rect, score) accessors.
Settings.CheckLastSeen
Public static boolean, set to true by default since at least 2018. Enables the optimization at the framework level.
checkLastSeenAndCreateFinder
Private method in Region.java. When called, creates a Finder restricted to the small rectangle around the previous match, falling back to full-screen only if needed.
The intent of these three pieces is clear once you piece them together. After a successful find, the rectangle of the match is stored in the Image object’s lastSeen field. On the next call to find for the same image, if Settings.CheckLastSeen is true and lastSeen is non-null, the code creates a Finder restricted to that small rectangle and tries to match there first. Only if the small-region match fails does it fall back to a full-screen scan.
This is a classic spatial memoization pattern, well-known in computer vision. Sikuli implemented it correctly, a long time ago.
The optimization works beautifully inside a single JVM session. If you run a script that performs screen.find("submit_button.png") ten times in a row, the first call pays the full-screen scan cost, but the next nine calls find the image almost instantly through checkLastSeen. The cache hit rate is essentially 100 percent on stable UIs.
There is, however, a subtle but critical limitation.
This is exactly what our benchmark exposed. Fifty separate JVM invocations, fifty Image objects with lastSeen always equal to null at the moment of the find call, fifty full-screen scans. The checkLastSeen optimization was active and present in the code throughout, but it had no input to work with. The cache was empty because the cache lived inside a process that died right after building it.
This is the core observation. The existing optimization was correct in design. It was simply unable to bridge the gap between JVM invocations. In a typical test environment, where each test is its own process, the optimization never had a chance to engage.
The code that solved the problem was already there, written long before this benchmark was performed, in a careful and well-tested form. The missing piece was not a clever algorithm. It was a way to keep the optimization’s input alive across process boundaries. A storage problem, not a computation problem.
Once the gap was identified, the question became: how do you persist Image.lastSeen between JVM invocations? Several candidate approaches surfaced, each with their own trade-offs.
| Approach | Description | Drawbacks |
|---|---|---|
| Sidecar file | Write foo.png.position next to each PNG | Two files to commit, risk of desynchronization, clutter |
| Central project file | Single .oculix-positions.toml at project root | Linear lookup cost, merge conflicts in parallel CI, full file rewrite per change |
| PNG ancillary chunk | Embed the position metadata inside the PNG itself | Requires writing PNG-aware code, but standard since 1996 |
The PNG ancillary chunk option won, for reasons that became clearer as we explored the constraint of CI environments.
The PNG file format, standardized by the W3C in 1996, is a structured container made of a fixed signature followed by a sequence of chunks. Each chunk has a four-byte length, a four-byte type, a variable-length data payload, and a four-byte CRC32 of the type and data.
The PNG standard distinguishes between critical chunks (IHDR, IDAT, IEND, and others), which are mandatory for decoding the image, and ancillary chunks, which are optional and can be safely ignored by decoders that do not recognize them.
We chose the type code oPLx for our chunk:
o: ancillary, not criticalP: private to OculiXL: reserved bitx: safe to copyDecoded: an optional, private, safe-to-copy chunk identified by oPLx. Other PNG tools encountering an OculiX-modified file see something they do not recognize, preserve it untouched on save, and ignore it on load. Compatibility is total.
The internal layout of the oPLx chunk’s data payload is deliberately small. The goal was a fixed-size, fast-to-parse, easy-to-debug binary structure. The total payload is exactly 34 bytes:
| Offset | Size | Type | Field | Notes |
|---|---|---|---|---|
| 0 | 4 | ASCII | Magic OPL\0 | Redundant identifier, prevents misinterpretation |
| 4 | 2 | uint16 (BE) | Version | Currently 1. Bump on breaking format change. |
| 6 | 4 | int32 (BE) | X | Pixel coordinate, last successful match |
| 10 | 4 | int32 (BE) | Y | Pixel coordinate, last successful match |
| 14 | 4 | int32 (BE) | Width | Match rectangle width, pixels |
| 18 | 4 | int32 (BE) | Height | Match rectangle height, pixels |
| 22 | 8 | int64 (BE) | Timestamp | UNIX epoch milliseconds, last update |
| 30 | 4 | int32 (BE) | Run count | Total successful matches since file creation |
Total: 34 bytes of payload, plus the standard 12 bytes of PNG chunk framing (4 bytes length, 4 bytes type, 4 bytes CRC32). Each pattern gains 46 bytes of metadata embedded in its PNG file. For a project with 500 patterns, this represents 23 kilobytes of additional space across the entire pattern library. Effectively negligible.
A few decisions deserve commentary.
Big-endian byte order
Non-negotiable. The PNG standard mandates big-endian for all multi-byte integers in chunk fields. Following the same convention inside our payload simplifies parsing and removes confusion with future tooling.
32-bit signed for coordinates
Generous. 16-bit unsigned would have sufficed for single-screen resolutions. We chose 32 bits to leave room for multi-monitor setups where coordinates extend into negative space and tens of thousands of pixels.
64-bit timestamp
Standard UNIX epoch milliseconds. No bytes saved here. Audit trails that span years require room.
Run counter
Allows detecting dead patterns (counter at zero), unstable patterns (high counter on young file), and locked-in patterns (counter grows without timestamp updates).
The chunk is plaintext at this stage. The integrity check is the standard CRC32 that PNG mandates at the end of every chunk; it catches accidental corruption but is not cryptographically strong.
The actual change inside OculiX comes down to three modifications in two existing files, plus one new utility class. Total addition: roughly 250 lines of Java. No external dependencies. No new Maven coordinates. The standard JDK classes DataInputStream, DataOutputStream, ByteBuffer, and java.util.zip.CRC32 cover all the needs.
1. Image.load() reads the chunk
After ImageIO.read() decodes the pixels, a separate streaming read parses the PNG chunks, locates oPLx, and calls setLastSeen(rect, 1.0) on the current Image instance.
2. doCheckLastSeenAndCreateFinder expands the search
Instead of creating a Region exactly the size of the previous match, it now creates a search box 2.5x larger, clamped to screen bounds. Tolerates UI drift between runs.
3. find() writes the chunk after match
After updating in-memory lastSeen, the chunk in the PNG file is streamed through a temp file and atomic-renamed. The position persists to disk.
In Image.java, the existing load method reads the PNG file from disk into a BufferedImage using ImageIO.read. This call decodes only the image data (the IDAT chunks). It does not parse other chunks. Our addition opens the same file separately, in streaming mode, walks through its chunks until it finds the oPLx chunk if present, parses the position metadata, and calls setLastSeen(rect, 1.0) on the current Image instance.
// In Image.java, after ImageIO.read(fileURL) at line 1018:try { File pngFile = new File(fileURL.toURI()); byte[] chunk = PngChunk.read(pngFile, "oPLx"); if (chunk != null && chunk.length >= 34) { ByteBuffer buf = ByteBuffer.wrap(chunk); byte[] magic = new byte[4]; buf.get(magic); if (magic[0] == 'O' && magic[1] == 'P' && magic[2] == 'L' && magic[3] == 0 && buf.getShort() == 1) { int cx = buf.getInt(); int cy = buf.getInt(); int cw = buf.getInt(); int ch = buf.getInt(); setLastSeen(new Rectangle(cx, cy, cw, ch), 1.0); } }} catch (Exception ignored) {}The streaming parser is deliberately optimized for the common case where the chunk is present and is one of the first non-critical chunks in the file. It skips the PNG signature (8 bytes), then enters a loop: read the chunk length, read the four-byte chunk type, compare it to oPLx, return the payload if matched, return null if IEND is reached, or skip the chunk data and CRC and continue otherwise.
This is the architectural detail that matters most. The chunk reading is a strict addition that fills a gap in the existing system. It does not replace anything. It does not modify the contract of any existing method. If the chunk is absent (a legacy PNG that has never been processed by OculiX, a PNG whose chunk was stripped by an aggressive optimizer, a PNG generated by an external tool), the code falls through to lastSeen being null, which is exactly the situation the existing codebase has handled for sixteen years. The fall-through path is the cold-start path. It still works. It is just slower.
In Region.java, the existing method creates a small Region exactly the size of the previous match rectangle. This works well for in-session use, where the image has just been matched at the exact same position. It works less well for cross-process use, where the UI may have drifted slightly between runs.
Our modification expands the search rectangle by a factor of 2.5 around the stored center, clamped to the screen bounds.
// Replace at line 2891:Rectangle ls = img.getLastSeen();int sw = (int) (ls.width * 2.5);int sh = (int) (ls.height * 2.5);if (sw > screen.w) sw = screen.w;if (sh > screen.h) sh = screen.h;
int cx = ls.x + ls.width / 2;int cy = ls.y + ls.height / 2;int sx = cx - sw / 2;int sy = cy - sh / 2;
// Translate to stay inside screen, do not truncateif (sx < 0) sx = 0;if (sy < 0) sy = 0;if (sx + sw > screen.w) sx = screen.w - sw;if (sy + sh > screen.h) sy = screen.h - sh;
Region r = Region.create(sx, sy, sw, sh);The 2.5 multiplier was chosen empirically:
| Multiplier | Effect |
|---|---|
| 1.5x | Occasionally misses drifted patterns |
| 2.0x | Marginal improvement, still occasional misses |
| 2.5x | Sweet spot: tolerates drift without ambiguity |
| 3.0x | No safety improvement, starts introducing ambiguity on dense UIs |
| 4.0x | Multiple visually similar elements get reached, confusing the matcher |
The clamping logic preserves the search box size when the pattern is near a screen edge. A pattern at (0, 1022) still gets a full 250 by 145 search box, just positioned at (0, 935) instead of being centered. The match rectangle for the original pattern still fits inside the search box.
In Region.java, the existing find method already calls img.setLastSeen(lastMatch.getRect(), lastMatch.getScore()) after a successful match. This updates the in-memory lastSeen field. Our addition extends this call with a write to the PNG file’s oPLx chunk.
// At line 2284 of Region.find(), after setLastSeen:img.setLastSeen(lastMatch.getRect(), lastMatch.getScore());
// New: persist to PNG chunktry { File pngFile = new File(img.getFileURL().toURI()); ByteBuffer buf = ByteBuffer.allocate(34); buf.put((byte) 'O').put((byte) 'P').put((byte) 'L').put((byte) 0); buf.putShort((short) 1); Rectangle r = lastMatch.getRect(); buf.putInt(r.x).putInt(r.y).putInt(r.width).putInt(r.height); buf.putLong(System.currentTimeMillis()); buf.putInt(getRunCount(img) + 1); PngChunk.write(pngFile, "oPLx", buf.array());} catch (Exception ignored) {}The chunk-writing logic streams through the PNG file once, copying every chunk through to a temporary file, replacing the oPLx chunk in place if it already exists, or inserting a fresh oPLx chunk before the IEND marker if not. At the end, the temporary file replaces the original via an atomic file system rename.
This streaming approach is more complex than reading the whole file into memory, modifying the byte array, and writing the result back, but it is meaningfully more robust:
| Property | In-memory | Streaming (chosen) |
|---|---|---|
| Memory cost | Proportional to PNG size | Constant (~8 KB buffer) |
| Crash safety | Risk of partial write | Atomic rename: old or new, never half |
| Cost on small PNG | < 1 ms | 1-2 ms |
| Cost on large PNG (1 MB+) | 20-50 ms + allocation | 4-8 ms, no allocation spike |
Reading and writing PNG chunks does not require an external library. The format is simple enough that a 200-line utility class handles both operations with zero dependencies. The class exposes two methods:
public static byte[] read(File png, String type) throws IOExceptionReturns the payload bytes if the chunk is found, null otherwise. The read uses streaming DataInputStream, with skip() to bypass non-target chunks. Exits as soon as the target chunk is located.
public static void write(File png, String type, byte[] payload) throws IOExceptionStreams the source PNG to a temp file, replacing or inserting the chunk, then atomic-renames the temp file over the original. CRC32 computed via java.util.zip.CRC32 (hardware-accelerated on modern CPUs).
private static final byte[] SIGNATURE = { (byte) 0x89, 'P', 'N', 'G', '\r', '\n', 0x1a, '\n'};private static final byte[] IEND_BYTES = { 'I', 'E', 'N', 'D' };The PNG file signature is 8 well-known bytes. The IEND chunk type terminates every PNG.
The total code in PngChunk.java is 218 lines including comments, blank lines, and the class declaration. The file has no imports outside the java.io, java.nio, java.util.Arrays, java.util.zip.CRC32, and java.nio.charset.StandardCharsets packages, all standard JDK.
A benchmark is only useful if its methodology is described in enough detail that someone else can reproduce it.
| Item | Value |
|---|---|
| CPU | Intel Core i3-7100 (2 cores, 4 threads, 3.9 GHz) |
| RAM | 8 GB DDR4-2400 |
| Storage | NVMe SSD |
| OS | Windows 10 build 19045 |
| Java | OpenJDK 25 from Eclipse Temurin |
| OculiX | 3.0.3 release, feature branch with modifications |
| Screen | 1920 × 1080, no DPI scaling |
| Pattern | Windows search bar fragment, 12 × 58 pixels, at (0, 1022) |
Each benchmark run consisted of fifty independent JVM invocations, each:
find call against the screenThe elapsed time was measured using System.nanoTime() immediately before and after the screen.find(pattern) call. This excludes JVM startup, library loading, and framework initialization. It includes only the find operation itself, including the screen capture inside the find.
A separate post-processing class read the fifty timing lines from the captured log and computed: arithmetic mean, median, minimum, maximum, range, and standard deviation.
| Scenario | Initial state of PNG | Expected behavior |
|---|---|---|
| Baseline | No oPLx chunk | All 50 runs in full-screen scan |
| Optimized | oPLx chunk pre-written | All 50 runs in small-region scan |
Both scenarios were measured cold-start (50 separate JVM invocations) to ensure the comparison reflects the CI environment.
The numbers are the headline of this post. Let me state them precisely.
| Metric | Baseline (FULL) | Optimized (ROI) | Improvement |
|---|---|---|---|
| n | 50 | 50 | — |
| Mean | 502 ms | 77.7 ms | ×6.46 |
| Median | 502 ms | 77 ms | ×6.5 |
| Minimum | 480 ms | 63 ms | ×7.6 |
| Maximum | 545 ms | 113 ms | ×4.8 |
| Range | 65 ms | 50 ms | comparable |
| Standard deviation | 18 ms | 11 ms | tighter |
Speedup factor on the find call alone: 6.46.
Low variance in both conditions
Standard deviation is 4% of mean in baseline, 14% in optimized. Neither is noisy enough to require repeated measurement.
Optimized scenario has a floor
Roughly 60-70 ms cannot be reduced further by this technique. Dominated by screen capture (20-40 ms) and OpenCV setup (20-40 ms).
Speedup depends on pattern size
Small patterns (under 50×50 px) give the largest speedup. Large patterns (300×300+ px) give a smaller speedup because the 2.5x search region itself becomes substantial.
Faster hardware preserves ratio
On modern CPUs, absolute timings shrink but the relative speedup factor stays at ×6-8. The overhead floor is proportionally less significant.
If we extrapolate to other typical hardware:
| Hardware | Baseline (FULL) | Optimized (ROI) | Speedup |
|---|---|---|---|
| i3 7th gen (measured) | 502 ms | 77 ms | ×6.5 |
| i7-13700K (projected) | ~150-180 ms | ~22-25 ms | ×8 |
| Apple M2/M3 (projected) | ~80-120 ms | ~10-15 ms | ×8 |
| GitHub Actions standard runner (projected) | ~200-250 ms | ~28-35 ms | ×7 |
The benchmark measures one find operation in isolation. A real-world test suite performs many operations, each containing at least one find. Translating the per-operation speedup into a suite-level wall-clock improvement requires a few assumptions.
Consider a moderately sized regression suite: 100 test cases, each performing 20 visual interactions on average, for 2000 total find calls.
| Scenario | Time per find | Total find time | Savings |
|---|---|---|---|
| Baseline (i3) | 500 ms | 16 min 40 s | — |
| Optimized (i3) | 77 ms | 2 min 34 s | 14 min 6 s |
| Optimized (modern i7) | 22 ms | 44 seconds | 15 min 56 s |
If the suite runs once per pull request on a CI runner billed at $0.008 per minute (the rough GitHub Actions standard runner cost):
| Activity | Cost per run (baseline) | Cost per run (optimized) | Annual saving (50 PR/day, 250 days) |
|---|---|---|---|
| Suite execution | $0.13 | $0.02 | ~$1 375 per suite |
For organizations running multiple suites in parallel across many projects, the saving compounds quickly.
Find is not the only cost
The speedup applies to find time only. The full suite also contains network calls, server processing, browser rendering, mouse and keyboard events, and waits. A suite where finds dominate (visual regression, end-to-end smoke) sees larger wall-clock improvement than a suite waiting on slow back-ends.
Optimization rewards stability
Teams that regenerate patterns frequently see the chunk reset on each regeneration. The first run after regeneration pays the full-screen scan cost. The optimization rewards UI stability and frequent test execution — both characteristics of mature codebases.
The structural argument for embedding the position in the PNG itself, rather than in a sidecar file or a central index, is best expressed by considering the CI lifecycle.
A CI build typically starts from a clean state. A fresh container is provisioned, the repository is cloned from origin, dependencies are installed, the build runs, the tests run, the artifacts are collected, and the container is destroyed. There is no persistent state between builds. There is no shared file system. There is no cache that can be relied upon to be hot.
The oPLx chunk inside the PNG removes this discipline burden entirely. The chunk is part of the PNG file itself. When a developer runs the test suite locally and the chunk is updated, the modified PNG appears in git status automatically. Git tracks PNG files because they are committed in the project. The developer cannot commit “the update” without committing “the chunk inside the update”, because they are the same file.
This means that when a CI build clones the repository, it gets the PNG file with the chunk already inside. The first test run in CI is already in the fast path. The optimization is active from the first second. No warm-up phase is needed. No cache must be primed. The position metadata is in the file, and the file is in the repository, and the repository is in the clone.
A team running the suite locally, committing the regenerated PNGs, and pushing to CI is effectively pre-warming the CI cache through normal development activity. There is no separate “cache warm-up” step. The cache warms itself as a side effect of using the tool.
The technique of embedding metadata in image files is not new. Adobe XMP, EXIF, IPTC, GIMP layer information, and many other systems use this mechanism. What is new, or at least underused, is the embedding of runtime state — not authoring metadata, not provenance information, but actual operational data that evolves as the file is used.
This pattern generalizes beyond visual automation:
Texture caches for graphics applications
A texture file could embed the time it was last loaded, the GPU memory pool it was allocated from, and the average frame time when it was active. A renderer could use this metadata to predict and pre-load textures.
Build artifacts for incremental compilation
A compiled object file could embed the hash of the source it was compiled from, the compiler version, and the optimization level. An incremental build tool could detect when recompilation is actually necessary.
Machine learning datasets
An image used in a vision model training set could embed its last classification result, the confidence score, and the model version. A cleaning tool could identify mislabeled samples without re-running the model.
Audit logs for compliance
A document image stored in a regulated workflow could embed a signed audit trail of the operations performed on it. The Ed25519 signing pattern from the OculiX MCP module would apply directly.
The common thread across all these examples is the same: keep operational state with the artifact, in a format the artifact’s primary tools will preserve, so that the state survives every distribution mechanism the artifact ever encounters.
A 6.5× speedup on the find is significant but not exhaustive. Several costs remain untouched by this optimization:
| Cost component | Status | Approximate weight |
|---|---|---|
| Screen capture | Unchanged | 20-40 ms per find |
| OpenCV setup | Unchanged | 20-40 ms per find |
| Tesseract OCR loading | Unchanged | ~300 ms per JVM cold start |
| OculiX framework init | Unchanged | ~1.5 seconds per JVM cold start |
| JVM startup itself | Unchanged | ~1-2 seconds per cold start |
Reducing these other costs would require deeper structural changes: pre-built native images via GraalVM (the JVM startup), incremental OpenCV initialization (the OpenCV setup), lazy OCR loading (Tesseract), and possibly an alternative screen capture API on each platform. Each of these is a separate optimization project, none of which is in scope for the current change.
The lesson of this whole investigation is not “PNG chunks are cool” or “spatial memoization wins”. Both of those are true but uninteresting on their own.
The lesson is structural.
Finding these gaps requires a specific habit: benchmarking systems in the configuration users actually experience, not in the configuration that is convenient for developers. In a tight in-process loop, the OculiX find is fast. In a cold-start CI environment, it is slow. The same algorithm, the same code path, the same operating system. The only difference is whether the context built by the previous match has survived to inform the next match.
The fix is rarely a new algorithm. It is usually a new place to store something that already existed, in a form that can travel through the boundaries the user’s actual workflow imposes.
The PNG ancillary chunk happens to be a particularly elegant place to store visual automation context, because the artifact whose context we want to preserve is itself a PNG file, and the standard that defines PNG includes a mechanism specifically designed for this kind of metadata, and the tooling ecosystem has respected that mechanism for thirty years.
The result is an optimization that did not require inventing a new algorithm, did not require breaking any existing contract, did not require any external dependency, and is fully backward compatible with files that do not yet have the chunk. It just needed someone to measure the actual baseline, identify the gap, and close it.
For anyone maintaining a similar tool, my parting suggestion would be: go measure your cold-start performance against your warm-cache performance, in the same configuration your users actually run. If the gap is large, the optimization opportunity is probably not where you think it is. It is probably in the space between two of your existing components, in the form of a state that briefly exists and then disappears.
Repository: github.com/oculix-org/Oculix
Issue tracking the implementation of the persistent locator chunk: oculix-org/Oculix#353