Ask HN: How to Resurrect a Site from Archive.org?

82 points · rrr_oh_man · 7 days ago

I recently bought the expired domain of a niche interest site because the previous owner was determined to let it die and did not want to put any effort in it anymore.

Is there a way I can "revive" it from archive.org in a more or less automated fashion? Have you ever encountered anything like it? I am familiar with web scraping, but archive.org has its peculiarities.

I really, really love the content on it.

It's a very niche site, but I would love for it to live on.


45 comments
duskwuff · 6 days ago
> I recently bought the expired domain of a niche interest site because the previous owner was determined to let it die and did not want to put any effort in it anymore. Is there a way I can "revive" it from archive.org in a more or less automated fashion?

Buying a domain name does not award you ownership of the content it previously hosted. If you have not come to some agreement with the previous owner, you should not proceed.

Show replies

ulrischa · 22 hours ago
I’ve seen a lot of people do this when resurrecting old niche sites. The high-level approach usually involves grabbing all the snapshots from archive.org, stripping out their timestamped URLs, and consolidating everything into a local mirror. In practice, you want to:

1. Collect a list of archived URLs (via archive.org’s CDX endpoints). 2. Download each page and all related assets. 3. Rewrite all links that currently point to `web.archive.org` so they point to your domain or your local file paths.

The tricky part is the Wayback Machine’s directory structure—every file is wrapped in these time-stamped URLs. You’ll need to remove those prefixes, leaving just the original directory layout. There’s no perfect, purely automated solution, because sometimes assets are missing or broken. Be prepared for some manual cleanup.

Beyond that, the process is basically: gather everything, clean up links, restore the original hierarchy, and then host it on your server. Tools exist that partially automate this (for example, some people have written scripts to do the CDX fetching and rewriting), but if you’re comfortable with web scraping logic, you can handle it with a few careful passes. In the end, you’ll have a mostly faithful static snapshot of the old site running under your revived domain.

Gualdrapo · 19 hours ago
I was commissioned to recover ideawave.ca from archive.org as its owner lost its database so pretty much all what was left was only on archive.org. I think it was under WordPress but he asked me to port it to Jekyll.

I scraped its contents (blog posts, pages, etcetera) with Python's beautifulsoup and redid its styling "by hand", which was not something otherworldy (the site was from line 2010 or so) and had the chance to put some improvements.

The thing with the scraping was that the connection was lost after a while and it was reaaaaaaaaaally sloooooooooow so I had to keep a register on memory of what was the last successful scraped post/page/whatever and, if something happened, restart from it as a starting point.

Got pennies for it, mostly because I lowballed myself, but got to learn a thing or two.

janesvilleseo · 5 days ago

Show replies

moxvallix · 6 days ago

Show replies