From legacy Joomla to static html in a day.

Published on Apr 14, 2026

Back in the day, I had the pleasure of building a wildly popular online art magazine with a few friends. Even now, you can still find links to it in hundreds of professional books and essays. The project came to an end about 12 years ago, but I’ve kept it running ever since—mostly to avoid breaking those backlinks that still carry weight.

The problem was that it was built on Joomla. As the years went by, it became an easy target for hackers. I got hit a few times and ended up wasting a lot of time trying to patch and secure an outdated system with known vulnerabilities.

Over the last couple of years, things got worse. Bots were hammering the site daily, creating unnecessary load, and I finally reached a point where enough was enough. Shutting it down completely wasn’t really an option, though.

Then a simple idea came to mind: why not just convert everything to plain HTML? The site doesn’t need updates—we just need to preserve the URLs and keep the content accessible.

With the help of Claude Code, I managed to pull off in a few hours something that would have felt nearly impossible a few years ago. Back then, I probably would’ve just taken everything offline to save myself the headache.

After a couple of days of trial and error, I got it working.

I figured this was worth sharing because it’s a pretty specific scenario, but I imagine others might run into the same problem.

And as usual, I learned a lot along the way.

Freezing a 10-year-old Joomla site into static HTML, locally, with Docker

A practical field report on archiving an ancient, PHP-5.6-era Joomla site into a static mirror — without touching production. Every step below came from a real afternoon of debugging, not a tidy plan.

The problem

We had a Joomla site (PHP 5.6, MySQL 5.x, K2 component) that needed to be preserved as static HTML. Not rebuilt, not migrated — just frozen into a set of .html files that can be served cheaply and forever without a database.

The plan sounded simple:

  1. Point wget --mirror at the site.
  2. Deploy the resulting static files.
  3. Kill the PHP/MySQL stack.

We tried it on production first. It hit three walls:

  • It took forever (hours) and risked rate-limiting.
  • It generated unnecessary CPU load on an old server already struggling.
  • Any interruption meant starting over.

Solution: run the site locally in Docker, mirror it there at LAN speed, then deploy only the static output. Sounds easy. Wasn't.

Setting up a PHP 5.6 + MySQL 5.6 stack on an Apple Silicon Mac in 2026

Both php:5.6-apache and mysql:5.6 images exist on Docker Hub but have no ARM builds. On an M2 Mac you must force platform: linux/amd64, which runs them under Rosetta. It works, just slower.

Minimal docker-compose.yml:

services:
  web:
    build: .
    platform: linux/amd64
    ports:
      - "8080:80"
    volumes:
      - joomla_files:/var/www/html
    depends_on:
      - db

  db:
    image: mysql:5.6
    platform: linux/amd64
    environment:
      MYSQL_ROOT_PASSWORD: root
      MYSQL_DATABASE: artmag_XXXX
      MYSQL_USER: artmag_XXX
      MYSQL_PASSWORD: 'uyNjq.jKA?_XXXXX'
    volumes:
      - db_data:/var/lib/mysql
      - ./db-init:/docker-entrypoint-initdb.d
    ports:
      - "3307:3306"
    command: >
      --character-set-server=utf8
      --collation-server=utf8_general_ci
      --max_allowed_packet=512M
      --innodb_log_file_size=256M
      --wait_timeout=600
      --net_read_timeout=600

volumes:
  db_data:
  joomla_files:

Tip: match the original DB name, user, and password from configuration.php. It saves you from editing most of that file and reduces the chance of a typo breaking everything.

The Dockerfile that wouldn't build

php:5.6-apache is built on Debian Stretch — EOL since 2022. The standard repos are gone. First attempt at apt-get install:

W: GPG error: http://deb.debian.org/debian stretch Release:
   The following signatures were invalid: EXPKEXXXX ...
E: Unable to locate package libpng-dev

Two fixes are needed:

  1. Redirect APT to the archive (archive.debian.org).
  2. Bypass the expired GPG keys — you can't renew Debian's expired signing keys, so you have to tell APT "yes, I know, proceed anyway."

Working Dockerfile:

FROM php:5.6-apache

RUN sed -i 's|deb.debian.org|archive.debian.org|g; \
            s|security.debian.org|archive.debian.org|g; \
            /stretch-updates/d' /etc/apt/sources.list \
    && echo 'Acquire::Check-Valid-Until "false";' \
       > /etc/apt/apt.conf.d/99no-check-valid-until

RUN apt-get -o Acquire::AllowInsecureRepositories=true update \
    && apt-get install -y --allow-unauthenticated \
        libpng-dev libjpeg-dev libfreetype6-dev \
        libzip-dev zlib1g-dev libxml2-dev libicu-dev \
    && docker-php-ext-configure gd \
        --with-freetype-dir=/usr/include/ \
        --with-jpeg-dir=/usr/include/ \
    && docker-php-ext-install -j$(nproc) \
        gd mysqli pdo pdo_mysql zip mbstring xml intl \
    && a2enmod rewrite \
    && rm -rf /var/lib/apt/lists/*

RUN printf "upload_max_filesize = 64M\npost_max_size = 64M\n\
memory_limit = 256M\nmax_execution_time = 300\n" \
    > /usr/local/etc/php/conf.d/joomla.ini

That is the minimum you need for a Joomla site with K2, GD-powered image thumbnails, and mod_rewrite SEF URLs.

macOS APFS vs. legacy filename encodings

Extracting the tarball into a bind-mounted folder failed hard:

tar: Can't create 'images/\215�\215�\216_\216_...jpg': Illegal byte sequence

The site had thousands of images uploaded over a decade with Greek filenames in legacy encodings (ISO-8859-7 / Mac Roman). macOS's APFS refuses any filename whose bytes aren't valid UTF-8. No tar flag, no locale override, nothing will make this work on an APFS bind mount.

The fix is to not let macOS see those filenames at all. Store the web root in a Docker-managed named volume (backed by the Linux VM's ext4 filesystem) instead of a bind mount:

volumes:
  - joomla_files:/var/www/html   # named volume — Linux ext4, accepts any bytes

Then extract inside the container:

docker compose cp public_html.tar.gz web:/tmp/
docker compose exec web bash -c \
  "cd /var/www/html && tar -xzf /tmp/public_html.tar.gz --strip-components=1 \
   && rm /tmp/public_html.tar.gz \
   && chown -R www-data:www-data /var/www/html"

The tradeoff: you lose Finder-level access to the files. For editing configuration.php you either docker compose exec web sed ... or docker cp the file out, edit, and docker cp it back.

The docker compose down -v trap

At one point, to reset the DB and re-import the dump, I ran:

docker compose down -v

This wipes every named volume in the project — including the one we just extracted 660 MB of Joomla files into. Had to re-extract from scratch.

The safe way to reset only the DB:

docker compose rm -sf db
docker volume rm joomla-legacy_db_data
docker compose up -d db

Scope the destruction to what you actually want to destroy.

MySQL packet size: importing a 660 MB dump

The dump imported 4,756 lines successfully and then:

ERROR 2006 (HY000) at line 4757: MySQL server has gone away

Classic max_allowed_packet limit. A single INSERT statement in the dump exceeded MySQL's default 4 MB packet size and the server closed the connection. Bump it (visible in the compose snippet above):

--max_allowed_packet=512M
--innodb_log_file_size=256M
--wait_timeout=600
--net_read_timeout=600

Under amd64 emulation on an M2, the full 660 MB import took about 8 minutes with zero errors.

DNS cache and container recreation

After recreating the db container (same name, new IP), PHP couldn't resolve it:

php_network_getaddresses: getaddrinfo failed:
Name or service not known

The web container had cached the old IP from the now-dead db container. docker compose restart web fixes it. Remember: if you recreate one service, restart its consumers too.

Patching configuration.php for the container environment

docker compose exec -T web bash -c "cd /var/www/html && \
  sed -i \"s|public \\\$host = 'localhost';|public \\\$host = 'db';|\" configuration.php && \
  sed -i \"s|public \\\$live_site = '';|public \\\$live_site = 'http://localhost:8080';|\" configuration.php && \
  sed -i \"s|public \\\$log_path = '.*';|public \\\$log_path = '/var/www/html/logs';|\" configuration.php && \
  sed -i \"s|public \\\$tmp_path = '.*';|public \\\$tmp_path = '/var/www/html/tmp';|\" configuration.php"

The key variable is $live_site. Set it to http://localhost:8080 so Joomla generates absolute URLs that match how wget is going to request them. If you leave it empty, some plugins and RSS feeds will emit URLs pointing to the production hostname, and wget will follow them off your local box.

The .htaccess bot-blocker that made wget return 403

Ran wget. Got a 403 instantly. But curl returned 200. What?

$ docker compose exec -T web grep -i "wget\|user.agent" /var/www/html/.htaccess
SetEnvIfNoCase User-Agent "wget" bad_user
SetEnvIfNoCase User-Agent "crawler" bad_user
SetEnvIfNoCase User-Agent "spider" bad_user
...

The site's production .htaccess — which we helpfully brought along with the files — blocks the literal string wget in the User-Agent header. Locally, this is nonsense. The fix is to either edit .htaccess or spoof the UA. I picked the spoof because it's less invasive:

--user-agent='Mozilla/5.0 Chrome/120 Safari/537'

That's enough. The .htaccess rule does substring matching, so any UA that doesn't contain "wget" passes.

The production-appropriate wget command, and what I stripped from it

The user's original command (originally for the live site) was:

wget --mirror --convert-links --adjust-extension --page-requisites \
     --no-parent --continue --tries=3 --timeout=30 \
     --reject-regex "itemlist/tag" \
     -P ~/Desktop/mysite \
     --user-agent="Mozilla/5.0 ... Chrome/120 ..." \
     --wait=1 --random-wait \
     https://www.xxxxxx.gr/

For localhost, most of that is wrong or wasteful:

Option Kept? Why
--wait=1 --random-wait Removed Politeness for public sites. On localhost, it just wastes your time.
--continue Removed Redundant with --mirror and actually incompatible in some cases.
--tries=3 --timeout=30 Kept Still useful if PHP hits an error on a specific page.
--user-agent=... Kept Needed to bypass the .htaccess wget block.
--adjust-extension Kept Saves files as .html — self-describing on any static host.
--no-parent Kept Defense against accidental off-site crawls.
--execute robots=off Added The site's robots.txt is production SEO policy, not a mirror plan.

And the --reject-regex grew a lot. For a Joomla/K2 site, you want to skip:

administrator                       — admin panel (useless, login only)
component/users                     — login/register/profile/auth forms
component/mailto                    — "email this article" forms (can send mail!)
component/finder                    — search indexer
component/banners                   — click-through counters (state-modifying)
task=user.(login|logout|register|remind|reset)  — auth actions
tmpl=component                      — stripped popup copy of every page
format=(pdf|feed|raw)               — alternate renderings (CPU-heavy PDF, duplicate RSS)
print=1                             — print views (duplicates)
itemlist/tag                        — tag archives (crawl trap)
start=[0-9]{4,}                     — pagination spam beyond page ~1000

Final command (as a shell script to avoid terminal paste issues with quotes):

#!/bin/bash
wget --mirror \
     --convert-links \
     --adjust-extension \
     --page-requisites \
     --no-parent \
     --tries=3 \
     --timeout=30 \
     --execute robots=off \
     --user-agent='Mozilla/5.0 Chrome/120 Safari/537' \
     --reject-regex '(administrator|/component/(users|mailto|finder|banners)|task=user\.(login|logout|register|remind|reset)|tmpl=component|format=(pdf|feed|raw)|print=1|itemlist/tag|start=[0-9]{4,})' \
     -P ~/Desktop/mysite \
     http://localhost:8080/

Verifying the mirror isn't missing anything

After 12 seconds of link-conversion at the end, wget reported "Converted links in 7793 files." How do we know this is actually complete?

The site had no published sitemap, so we compared the database to the file system directly:

-- count of publicly published content
SELECT COUNT(*) FROM xjhze_k2_items WHERE published=1 AND trash=0;
-- → 7,462

Extract every {id}-{alias} from the DB, extract the same from filenames in the mirror, comm -23 them:

# DB side
docker compose exec -T db mysql -sN artmag_XXXX \
  -e "SELECT CONCAT(id,'-',alias) FROM xjhze_k2_items \
      WHERE published=1 AND trash=0;" | sort -u > all_k2.txt

# Filesystem side
find . -path '*/item/*.html' | sed 's|.*/||; s|\.html$||' | sort -u > mirror_k2.txt

comm -23 all_k2.txt mirror_k2.txt > missing.txt
wc -l missing.txt
# → 2,573 missing

That number looked scary until we checked whether those "missing" URLs were actually publicly reachable on Joomla itself. We sampled 20 of them and hit each via the raw K2 URL on the running container:

curl -sI "http://localhost:8080/index.php?option=com_k2&view=item&id=2688&Itemid=1"
# → HTTP/1.1 404 Not Found

All 20 returned 404 from Joomla itself. They are orphaned DB records — content assigned to categories that have no menu item, so the router has no way to resolve them. They are not reachable by any real user, any bot, or any search engine link.

So the mirror is complete for every URL that actually resolves. Google can't have indexed URLs that 404. That's a good reminder: "content in database" is not the same as "content on site."

The extensionless URL problem

Joomla SEF URLs look like /some-article (no extension). wget with --adjust-extension saves them as some-article.html. If Google has https://example.com/some-article indexed and you deploy your mirror to a dumb static server, that URL 404s.

The right fix is on the server, not in wget. Put this .htaccess in the mirror root:

RewriteEngine On
DirectoryIndex index.html

# If the exact request doesn't exist as a file or directory,
# try appending .html
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.+?)/?$ $1.html [L]

# Normalize trailing slashes to match original Joomla SEF
RewriteCond %{REQUEST_URI} ^(.+)/$
RewriteRule ^ %1 [R=301,L]

Now /some-article, /some-article.html, and /some-article/ all serve the same some-article.html — and the browser URL stays extensionless, matching what Google indexed.

Rewriting absolute URLs in the mirror

A Joomla site writes absolute URLs into many places: canonical <link> tags, og:image meta, RSS feed links, and any <a> tag in an article body where the author pasted a fully-qualified URL. After mirroring, 7,752 HTML files contained https://www.xxxxx.gr/... (fine when viewing the mirror at that same domain) and 7,753 files contained http://localhost:8080/... (because we set $live_site to that — fatal when deploying to production).

Both need to be rewritten to root-relative paths. macOS BSD sed dies on non-UTF8 bytes in Greek content — use perl instead:

cd ~/Desktop/mysite/localhost:8080

# strip production domain
find . -type f \( -name '*.html' -o -name '*.htm' \) -print0 | \
  xargs -0 perl -i -pe 's|https?://(?:www\.)?xxxxx\.gr||g'

# strip local dev host
find . -type f \( -name '*.html' -o -name '*.htm' \) -print0 | \
  xargs -0 perl -i -pe 's|https?://localhost:8080||g'

After this, href="https://www.xxxxx.gr/about" becomes href="/about" — a root-relative URL that works identically on xxxxx.gr, on staging, or opened locally from a static server.

Lessons

  • Run legacy mirrors locally, never on production. You get LAN-speed crawling, zero production load, and the freedom to retry.
  • Named volumes ≠ bind mounts. If you're on macOS and the files contain legacy encodings, you must use a named volume. Binding exposes you to APFS's UTF-8 enforcement.
  • EOL base images need apt archive redirection AND signature bypass. Both, not one or the other.
  • docker compose down -v is dangerous. It targets all volumes in the project, not just the one you were thinking about.
  • Match the DB credentials to the dump, not the other way around. Fewer edits, fewer footguns.
  • Production's .htaccess comes along for the ride. Read it; it may be blocking you.
  • The --reject-regex is the difference between a 30-minute mirror and an infinite crawl. Joomla generates many URL variations of every article (print, PDF, RSS, tag archives). Exclude them explicitly.
  • Verify coverage by diffing the DB against the filesystem. A low-link-count article can be orphaned from navigation; a sitemap plugin (or your own DB query) is the ground truth.
  • "Content exists in DB" ≠ "content is reachable on site." Orphan records are surprisingly common on sites that have lived for a decade.
  • URL portability is a rewrite problem, not a wget problem. --convert-links only rewrites references to hosts wget crawled. Anything hardcoded in article bodies needs a separate perl -i -pe pass.
  • Use perl -i -pe instead of sed -i on macOS for mixed-encoding content. BSD sed gives up on the first illegal byte.
  • Extensionless URLs need a server-side rewrite rule, not a wget option. Keep --adjust-extension (self-describing files) and add the Apache .htaccess rule.

The mirror we ended up with: 7,765 HTML pages, 37,424 files, 3.2 GB, covering every publicly reachable URL on the original site. It can now be hosted on any static-file server, indefinitely, for pennies per month. The Docker stack can be torn down at any time and spun up again on demand by re-running docker compose up -d. Total elapsed time from "let's try this" to "fully archived site": about one afternoon, most of which was spent debugging the issues catalogued above.