Davide Angelocola

The Slow Fix

30 April 2026

This is a success story from when I was working as a consultant in the mid 2010s. This is an extract from notes I took at that time. The inspiration is “The Phoenix Project”¹ — minus the novel format. The goal here is simpler: document what happened, in case it’s useful to someone standing at the same starting point.

Context

The project had been handed over to two developers with some knowledge transfer. The original authors were gone. Senior developers at the company knew about it and kept their distance — the project had a reputation. What remained was 250,000 lines of Java, a 90 MB WAR file — a Java web application packaged for deployment into a servlet container, Apache Tomcat in this case, running a custom-packaged distribution — and a production system going down roughly once a day — from unsynchronized access to shared mutable state, connection leaks exhausting the database pool, and out-of-memory errors from unbounded in-memory accumulation. Apache Cocoon, Apache Struts, Spring MVC, Hibernate, Drools, jBPM — a decade of framework choices stacked on top of each other. Getting it running locally took me almost a week. The developers had learned, through experience, that the safest move was to touch as little as possible.

team: when I joined the team, there were two backend developers and one frontend developer. No QA, no sysadmins. The team wasn’t incompetent — they were paralyzed by a system that punished curiosity. The problems weren’t just technical: we had committed to features with delivery dates. Every stabilization effort had to compete against other priorities — normal in itself, but hard to navigate when production went down daily.

database: the database was MySQL, with a mix of MyISAM² and InnoDB³ tables. MyISAM does not support transactions and any operation that touched both table types had no atomicity guarantee — a failure mid-write could leave data partially committed with no rollback possible. This had gone unaddressed long enough that compensating logic had accumulated throughout the codebase.

too much magic: a significant portion of the business logic layer was built on Java reflection and dynamic proxies. The code was hard to follow statically and harder to debug at runtime. Proxied objects masked their actual types; reflective calls bypassed IDE navigation and static analysis. Bugs in this layer produced failures with no obvious connection to the triggering code.

undocumented internal libraries: the system relied on several in-house libraries with no documentation and no original authors left to ask. The libraries covered areas that standard frameworks already handled — serialization, HTTP communication, data transformation — but with custom behavior that deviated in undocumented ways. Every interaction with them required reverse-engineering from usage in the codebase. We had custom libraries for logging, JDBC utils, a couple of SQL DSLs, XML, JS/CSS minification, etc. In general, the code had no documentation / javadoc.

insecure and unreliable processes passwords were stored in plain text, protected by a custom security framework that was itself undocumented. Releases were done manually — copy-paste SQL patches directly into the database console, deploy and hope nothing breaks, and hope the same patch hadn’t already been applied in a previous release. The build process required tribal knowledge that lived only in people’s heads, and those people had left. Unit tests existed in the Maven configuration but were disabled. There was no CI, no integration tests, no structured logging — the codebase mixed Logback configuration with a custom Logger wrapper class, and errors surfaced through ex.printStackTrace() and also via email, for specific errors. Empty catch blocks were everywhere. Apache Maven and Ant JARs had somehow ended up on the production classpath. Onboarding a new developer meant days of undocumented setup rituals.

Not everything was broken. The team was already on AWS as early customers of EC2 and S3, which gave us flexibility without needing physical infrastructure. SVN was in use with sane branching defaults — at least history was preserved. There was a backup/restore culture, which meant the database wasn’t a gamble.

Months 0–3: First steps

The first months were the hardest. Daily outages meant the team was in constant firefighting mode — no space to think, no focus, no sense of direction.

At this point we had two environments: production and pre-production. Pre-production received fixes with a copy of the previous day’s production data — which meant hotfixes and new-feature testing could not happen simultaneously without manually syncing the databases. One EC2 instance for the web app, a primary MySQL instance with daily backups in S3, and a MySQL replica for data-warehouse queries.

The first concrete change was removing Struts and consolidating on Spring MVC. This was already underway before I joined the team and it was in good shape. Not because Struts was the biggest problem — it wasn’t — but because having two web frameworks in the same application was unnecessary complexity with no payoff. Incremental changes from the start: no big-bang refactorings, no feature freezes.

Months 4–6: Hard choices

With a baseline of instability, we started clearing the underbrush. Unused classes, JAR conflicts, disabled tests. The JAR hell was particularly bad — Apache Maven and Ant artifacts on the production classpath, multiple copies of the same dependency under different group IDs, version conflicts surfacing as runtime errors with no clear cause. We untangled it incrementally, release by release, asking always: “Why do we need to keep this dependency?” The rule of thumb: only touch what we can prove correct, and only around features we were already changing in that release.

Logging moved from printStackTrace and email alerts to SLF4J with Logback: this was mostly grep-and-replace work across the codebase. Empty catch blocks were everywhere: many silently ignored exceptions, with extra code paths added throughout to compensate for or hide the real issue.

The build became a single command: mvn package, with unit tests re-enabled to cover the new parts of the system.

The data-warehouse job, which ran weekly and reliably died with out-of-memory errors, got its first real attention. It had been built to simulate materialized views in MySQL: the primary node accepted OLTP reads and writes, while a replica received binary log changes and served read-only queries. Hibernate, combined with some reflection hacks and connection-pooling issues, made it unstable — it frequently failed, and we had to manually restart it the following day. The fix was simple: track which entities changed during the OLTP workload in a separate table, then process those changes asynchronously with eventual consistency. Instead of refreshing all entities at midnight, the system updated only those that had actually changed — an incremental materialized view. This made the process reliable and kept warehouse data current. It also freed the team to focus on other problems.

Other minor (as effort) but important fixes:

addressing some thread safety issues by protecting shared mutable parts with synchronized blocks — one of the primary drivers of the daily outages.
improved backup/restore tooling — a single script using SSH tunnel to avoid intermediate copies; this was the same script used to restore the database in AWS;
migrated the collation of all tables to utf8;

This was the first moment the system became visibly more stable. The team could focus more on new features.

Months 7–9: Infra as code

Infrastructure started getting attention. In my previous consultant work, I was exposed to Puppet for on-premises infrastructure. I proposed it in a dev meeting for AWS EC2 and switched to RPM builds via a Maven plugin, replacing the manual deployment rituals with something repeatable (the release was built on a developer machine, hoping it had everything committed). After a few days of setup, we could create a new environment with a single puppet apply, pulling all packages — JRE, Apache Tomcat + custom jars, sshd with our SSH public keys, and all changes in /etc. It was a good moment for the team: everyone became a sysop. We also started deploying smaller releases more often — instead of once every 3 months, we shipped every 3 weeks.

A UAT (User Acceptance Test) environment came online — for the first time, there was a place to verify changes without blocking pre-production, usually reserved for hot fixes.

Drools, a rules engine used for a small part of the business logic, was removed and replaced with simple Java validation logic. It was pulling a large number of extra JARs — Eclipse JDT, ANTLR, ASM, protobuf, xstream, commons-* and more. It had likely been intended for broader use, but the team had no need for it beyond the narrow case. The weight was not justified — a clear tension with the previous refactoring direction. Until that point we were replacing custom libs with external libs, but in this case the logic was trivial conditions:

rule "No Gold Customers"
when
    // This matches if there is NO Customer object with status "Gold"
    not Customer( status == "Gold" )
then
   errors.add("Invalid status");
end

The application produced documents for Word/Excel using OpenOffice and ODT templates⁴: the OpenOffice process was unstable due to memory leaks inside the process itself, which couldn’t be fixed directly. What worked was:

restarting the service daily (the leak was small — a few KB per invocation)
but the real fix was starting OpenOffice per request: slower per request but far more stable operationally.

We also started forward-compatibility work for Java 8, already mainstream elsewhere but not yet adopted here. The migration ran in two phases: first, use Java 8 as the compiler target; then migrate the codebase to embrace new language features — notably lambdas and the new date/time API (the codebase still used Calendar and Date). The second phase ran in parallel with other ongoing work.

The milestone that stood out most: the team started treating bugs as opportunities to understand the system rather than fires to extinguish. The rule was: bug → failing test case → fix → deploy to UAT. We started modifying core parts of the system with less fear. The boy scout rule — leave the code better than you found it — had become a team habit. At that point, leading by example had proven to be the most effective lever.

Months 10–12: More confidence

We started to run.

Our efforts started to pay dividends, along with a migration from SVN to Git. Both changed how the team worked more than any code change had. Jenkins meant every commit was verified automatically; Git meant branching was cheap enough to use. To enable continuous integration, we started deploying the master branch to UAT daily — possible thanks to Jenkins, RPMs, and Puppet.

Flyway was introduced to manage database migrations — before this, schema changes were applied by hand with no versioning, which meant different environments could silently diverge. MyISAM tables in MySQL were converted to InnoDB, restoring transactional guarantees that had been missing. This allowed us to catch database migration issues early.

I remember setting up a Jenkins job to gamify the Java 8 migration: every morning we tracked how many files remained to migrate. We dropped a custom library that emulated lambdas using bytecode generation at runtime — Lambdaj — and all uses of the Joda-Time library (this caused a few easy-to-fix regressions, but the team started using Java 8 in production — a big selling point internally).

All new code required unit tests — not as a rule handed down, but as a shared expectation the team had started to own.

Stateful DAOs — a pattern that had caused unit-of-work problems throughout the codebase — were systematically removed. The DAOs were duplicating what the Hibernate Session already provides: identity map, dirty tracking, first-level cache. Except they did it with bugs. The fix was to delete the custom state management and rely on the Session directly.

Months 13–18: Team maturity

The infrastructure side caught up with the application side. CentOS 7 replaced CentOS 6; systemctl replaced the handwritten bash scripts managing services. We switched to the upstream Apache Tomcat 7 package, dropping the custom setup we inherited.

We introduced Sonar and IntelliJ IDEA inspections for static analysis; the number of subtle bugs they surfaced was striking.

At this point, the database was still holding passwords in cleartext — we had an internal discussion about using Apache Shiro but ultimately settled on Spring Security (the project was already using Spring and most developers were comfortable with it). A few hours later, passwords were stored as BCrypt⁵ with a lazy migration pattern: the system could read both formats, and as users authenticated, their passwords were migrated on the fly.

Hibernate was upgraded from 3.2 to 4. The migration was tricky: the codebase had custom code hooking into Hibernate callbacks to handle dynamic proxies and entitlements, and JAR conflicts kept the upgrade stuck. We moved incrementally: 3.2 → 3.3 → 3.4 → latest 3.x → 4.x. The jump from 3.x to 4.x produced one rollback — our custom lifecycle listeners for dynamic proxies silently stopped firing, causing entitlement checks to pass unconditionally in UAT. Integration tests caught it before production. We extracted the listener logic into an explicit wrapper, re-attempted, and moved on. At the end, the code around that layer was cleaner and more explicit than anything it replaced — a good case for bottom-up design.

By this point, the business had started to see the benefits of our invisible work. They began onboarding more internal users — and with that came more opportunities for the team. Since the system was growing in usage, we added a second application node to share the load. Puppet made this straightforward. Hazelcast came in for distributed locking across the cluster and as a second-level cache for Hibernate. Adding a distributed in-memory cluster was a deliberate reversal of our complexity-reduction principle — but the alternative, coordinating locks through the database, was slower and more fragile at the write volumes we were seeing. We treated it as a bounded trade-off: one new component with a clear scope, not a new platform.

                       ┌─────────────────┐
                       │  Apache HTTPD   │
                       │  (load balancer)│
                       └────────┬────────┘
                                │
                   ┌────────────┴────────────┐
                   │                         │
            ┌──────▼──────┐           ┌──────▼──────┐
            │  App Node 1 │           │  App Node 2 │
            │   Tomcat    ├───────────┤   Tomcat    │
            │             │ Hazelcast │             │
            │  · L2 cache │  cluster  │  · L2 cache │
            │  · dist lock│           │  · dist lock│
            │  · sessions │           │  · sessions │
            └──────┬──────┘           └──────┬──────┘
                   │                         │
                   └───────────┬─────────────┘
                               │
                  ┌────────────┴────────────┐
                  │                         │
           ┌──────▼──────┐          ┌───────▼──────┐
           │   MySQL     │          │    MySQL      │
           │   Primary   │─binlog──►│   Replica     │
           │  (R/W OLTP) │          │  (read-only)  │
           └─────────────┘          └──────────────┘

Around this time, a frontend developer left — taking with him some knowledge about his custom solution for minifying JS and CSS files in Java. The process had caused problems in the past: it was a manual step that had to be triggered every time a JS or CSS file changed, and it reliably broke in UAT even when it worked locally. I proposed switching to wro4j to build minified assets automatically during the Maven build, making it impossible to forget (it took a few hours of work and a couple of bad errors caught in UAT before full stability).

Load testing ran for the first time, giving numbers instead of guesses when discussing performance. We used JMeter to simulate load and surface issues.

We introduced JavaMelody⁶ for runtime monitoring to give some visibility on what was going on. Some metrics were also exported to AWS CloudWatch using a setup configured in Puppet.

Another important point was publishing AWS metrics on a dashboard to check various parts of the system, using some Python scripts and AWS CloudWatch.

Months 19–24: Stability

We were deploying several times a week. By this point the system looked almost nothing like what we had inherited. The WAR file was down to 60 MB. The codebase was at 180,000 lines — 70,000 fewer than when we started, despite three years of new features. There were over 200 database migrations in Flyway, every schema change tracked and repeatable. Production outages had been absent for months.

The continuous-improvement cycle was fully in place. Integration tests and automated acceptance tests replaced a purely manual QA process. jBPM, the workflow engine, was removed — its use case didn’t require it. A straightforward state machine, written from scratch and fully covered by integration tests, replaced it with a fraction of the complexity.

After an incident dealing with expired SSL certificates, I proposed to use Let's Encrypt (at the time a very young company) certificates to replace manual certificate management, which had been adding overhead as the number of environments grew — by this point we had production, pre-production, UAT, and CI.

At the time we completed the “Excel over HTTP” integration: the Apache POI version was holding the entire Excel file in memory, causing out-of-memory errors. After an internal discussion, I proposed avoiding Apache POI for this part and directly manipulating the XML inside the file (XLSX is a ZIP archive of XML files, so direct manipulation was straightforward) — this was faster and consumed less memory. The trade-off was another small internal library, but this time it was clearly documented why it was needed and all the options considered at that time and covered by unit and integration tests.

Months 24+

Around that time, I left the company to move to another city. The team had grown to 5 backend developers, 2 frontend developers, and 2 QAs — still no sysadmins. And it was in good shape: confident, autonomous, and shipping regularly. What I found on day one and what I left behind were barely recognizable as the same team culture.

This wasn’t my first lead role. But in retrospect, it was one of the most satisfying — not because of the technical work, but because of watching the team grow. They went from being afraid to touch anything to taking ownership of the system, making decisions independently, and treating problems as something to solve rather than something to survive.

Tactical and Strategic

Early on, one of the developers proposed a full rewrite. The reasoning was understandable — the codebase was complex and undocumented, the framework stack was a decade old, and the system was going down every day. Starting clean felt like the obvious answer.

It wasn’t. Joel Spolsky called it “the single worst strategic mistake that any software company can make”⁷ — and the reasoning holds: the old system contains years of accumulated domain knowledge. Bugs that turned into features. Edge cases silently handled. Compensations for upstream failures. Throwing it away means losing all of that, and you won’t know what you lost until it’s missing in production.

Instead, the approach was cultural before technical. Empower the developers to make changes. Replace fear with a process: if something breaks, understand why, fix it, and share what you learned. Every bug is an opportunity to understand the system better, not a sign that someone should have been more careful.⁸ Over time, this shift mattered more than any individual refactoring.

On the technical side: fail fast on broken invariants, add post-condition checks, and when in doubt, do minimal design and wait to commit for more complex architectures. Complexity was already the enemy — every change that added more of it made the next change harder. A practical heuristic emerged early: prioritize what I call mechanical refactorings — changes trivial to prove correct that move the code in the right direction, scoped to what we were already touching in that release. Safe, bounded, and compounding over time.

Tactical fixes buy time and reduce pain immediately. Strategic bets change the system’s trajectory — they compound over time and enable things that were previously impossible. The key distinction: tactical work has an immediate, visible payoff. Strategic work often looks like overhead until suddenly it doesn’t. The judgment isn’t “fix the most painful thing.” It’s “fix the thing that opens the next door.”

The clearest examples from this project:

Fixing daily outages was the obvious urgent priority. The real strategic win was that the team stopped being afraid to touch the system.
A single mvn package command was a convenience fix. A few months later, it made CI possible.
A UAT environment looked like a nice-to-have. It was the precondition for deploying multiple times per week to iterate quickly on new features (without blocking pre-prod).
Flyway migrations eliminated an operational annoyance. They also made database divergence between environments structurally impossible.
Puppet provisioning seemed like infrastructure housekeeping. When we needed to scale, it was a non-event.
The Java 8 migration looked like a compiler upgrade. It eliminated some extra libraries and aligned the codebase with the ecosystem, unlocking more library upgrades. It also highlighted hidden areas of code that triggered more cleanup work.

There’s a useful negative example too. At some point the manager asked what it would cost to migrate to PostgreSQL — a reasonable question, given Oracle’s acquisition of MySQL and the uncertainty it raised about licensing. The call was to defer it. Not because it was wrong in principle, but because the preconditions weren’t there: no reproducible migrations, no stable environments, no test coverage to verify behavior across a different database. It would have been high-cost, high-risk work with questionable strategic payoff given where the system was. Sequencing matters as much as choosing.

Recommended techniques

These are the techniques that enabled us to slowly fix an unstable system — I still use them today:

Make it easy to onboard new developers — document it in the repository, remove manual steps and listen to new joiners, ask them “are we crazy doing this like that?”.

Bugs as opportunities — reframe team culture: no drama; fix, learn, share, and move on.

Discuss and share principles — a guiding principle shapes how the team approaches the system. A good example⁹: “Make it so every subsystem can be found and repaired manually, even if you need to crawl to reach it.” Or something like these rules¹⁰.

Incremental refactoring — no big-bang rewrites. If a refactoring is too large to merge incrementally, split it or put it behind a feature flag. Tidy the code first then change it.

Test before replace — before removing or replacing a component, write tests that capture its behavior. The tests become the specification for the replacement.

Write good tests — tests that only verify happy paths, or that duplicate implementation rather than testing behavior should be avoided. Use mocks to test hard to reproduce conditions like “email send failed” or “S3 upload failed”.

Segregate subsystems — keep transactions, logging, retry logic, and background jobs independent. Circular dependencies between subsystems make changes expensive.

Minimal design — throw away what isn’t needed and don’t over generalize. The best code is the code that doesn’t exist.

Be careful with build vs buy: decide if a specific library is a core competitive advantage that must be built and maintained in house. Also document why the decision was made.

Automate deploy and provisiong — manual steps are where environments diverge and where outages start: database migrations, infrastructure provisioning, secrets management, etc.

Read: Michael Feathers, Working Effectively with Legacy Code — the practical manual for everything described here. If you’re inheriting a large codebase without tests, start there.

By the Numbers

	Start	End
Production outages	2-3/week	0 for months
Deploy frequency	~1/quarter	Several/week
Codebase	250,000 LOC	180,000 LOC
WAR size	90 MB	60 MB
DB migrations	0 tracked	200+ in Flyway
Environments	2 (pre-prod, production)	4 (CI, UAT, pre-prod, production)
Language	Java 5	Java 8
Onboarding time	~ 1 week	few minutes, git clone + database restore

Gene Kim, Kevin Behr, George Spafford — The Phoenix Project (2013). ↩
MyISAM — the original MySQL storage engine; no support for transactions, foreign keys, or row-level locking. ↩
InnoDB — the transactional MySQL storage engine; supports ACID transactions, row-level locking, and foreign keys. ↩
ODT (Open Document Text) — the XML-based document format used by OpenOffice and LibreOffice. ↩
Spring Security BCrypt — https://docs.spring.io/spring-security/site/docs/current/api/org/springframework/security/crypto/bcrypt/BCrypt.html ↩
a monitoring tool for Java / Java EE, source ↩
Joel Spolsky, Things You Should Never Do, Part I (2000). ↩
“Every bug is an opportunity” — a personal motto that stuck. Developers who have worked with me since will recognize it. The phrase does real work: it reframes a problem as a chance to learn, and that shift in mindset changes how a team responds under pressure. ↩
Picard Engineering Tips (@PicardTips on X) — fictional engineering advice in the voice of Jean-Luc Picard. https://x.com/PicardTips ↩
“18 Subtle Rules of Software Engineering”, source ↩