Rendered at 10:35:50 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
yosinago 3 days ago [-]
Hey everyone,
I recently wanted to see how far I could push the JVM on a classic I/O problem: parsing a 1-million line log file to extract error counts per hour.
I started with the most naive, readable Spring-tutorial style approach (Files.readAllLines + Stream API). The baseline: 872ms and 19 GC pauses.
I decided to rewrite it step-by-step to understand the exact bottlenecks. In my final iteration (V04), I brought the execution time down to 78ms with absolutely Zero GC allocations, running on a single thread (Intel Core i5).
How I did it:
Off-Heap Memory: I dropped String allocations entirely and used the FFM API (MemorySegment) mapped directly to the file via FileChannel.
SWAR (SIMD Within A Register): Instead of reading byte-by-byte, I used ValueLayout.JAVA_LONG_UNALIGNED to load 8 raw bytes per CPU cycle. Since log lines are variable length, the memory isn't aligned, but modern x86 handles this beautifully.
Bitwise Operations: I used bitmasks to locate the \n character across all 8 byte lanes simultaneously, and Long.numberOfTrailingZeros to pinpoint the exact offset.
BCE (Bounds Check Elimination): By extracting the hour integer and using a bitmask index (hour & 0x1F), the JIT compiler could prove the value is always within [0,31], completely eliminating array bounds checks.
It was an amazing journey into mechanical sympathy and understanding what Java does under the hood.
If you're interested in performance engineering, I documented the entire journey (from V01 to V04, isolating each bottleneck and explaining the JVM/Hardware impact) in the repository
I'd love to hear your thoughts or if you see any further micro-optimizations I might have missed!
I recently wanted to see how far I could push the JVM on a classic I/O problem: parsing a 1-million line log file to extract error counts per hour.
I started with the most naive, readable Spring-tutorial style approach (Files.readAllLines + Stream API). The baseline: 872ms and 19 GC pauses.
I decided to rewrite it step-by-step to understand the exact bottlenecks. In my final iteration (V04), I brought the execution time down to 78ms with absolutely Zero GC allocations, running on a single thread (Intel Core i5).
How I did it:
Off-Heap Memory: I dropped String allocations entirely and used the FFM API (MemorySegment) mapped directly to the file via FileChannel.
SWAR (SIMD Within A Register): Instead of reading byte-by-byte, I used ValueLayout.JAVA_LONG_UNALIGNED to load 8 raw bytes per CPU cycle. Since log lines are variable length, the memory isn't aligned, but modern x86 handles this beautifully.
Bitwise Operations: I used bitmasks to locate the \n character across all 8 byte lanes simultaneously, and Long.numberOfTrailingZeros to pinpoint the exact offset.
BCE (Bounds Check Elimination): By extracting the hour integer and using a bitmask index (hour & 0x1F), the JIT compiler could prove the value is always within [0,31], completely eliminating array bounds checks.
It was an amazing journey into mechanical sympathy and understanding what Java does under the hood.
If you're interested in performance engineering, I documented the entire journey (from V01 to V04, isolating each bottleneck and explaining the JVM/Hardware impact) in the repository
I'd love to hear your thoughts or if you see any further micro-optimizations I might have missed!