Is archiving the internet a good thing?

Today I was trying to look something up and got a dead link to a forum, however I was able to find a copy of it on the Internet Archive but as most older forums tend to be this one had some unsavory comments on it.

This got me thinking, is the Internet Archive a good thing? Should all data be backed up perpetually to a giant archive for future generations to dig through and make sense of?

There’s obvious pros and cons to it. I think everyone has experienced at some point the joy in finding an answer on a very old forum or blog post the solution to your weird obtuse tech error but there’s also the downside that privacy just doesn’t exist on the internet if material is public facing in any manner and it’s very hard to scrub your trail from it.

Various countries have laws surrounding the “Right to be Forgotten” which works on paper and less so in practice. For example if you deleted your Reddit profile and told Reddit to remove all data about you under your countries laws they would probably do that for you. But what about all the other sites and services that are out there scraping it? You would have to figure out who all those site are and find them and ask them to remove it and depending on what that entity is maybe they do that or maybe they consider anonymizing your data good enough despite the very real fact that anonymizing data is often not enough as you can determine an individual based off enough unique data points already (see browser fingerprinting).

This is where the Internet Archive gets a bit tricky. The Internet Archive will happily take down a personal site or blog that has been scraped by them but you pretty much have to file a DMCA providing proof of who you are to make sure it’s not just a random person trying to get someone else’s site delisted. This becomes trickier when things like forums and other social media like websites get involved. The Internet Archive is just scraping whatever people tell it to scrape and will do so unless it’s on their list of URLs not to touch. This can lead to instances where say I want my information removed from the archive I would have to find each instance of it myself and ask them to remove it and hope that they believe I am the same Wazanator making the request as the one that is in the archived MySpace page from the early 2000’s. Then there’s the question of whether I should be able to even do such a thing or if I should have to live with a cringe MySpace page for everyone to find.

Personally I think I fall in favor of it just because I like digging through old data and having to recognize our past and it’s mistakes is something we shouldn’t avoid. The other day I found an archive from 1996 of a CD ROM that was advertised as having 1078 Weird Textures in the “highly detailed” resolution of 256x256. And while I can definitely see a reason why user data should be purged for personal protection I also think maintaining things like forum posts in the long term outweigh the bad more then the good. I don’t really like the idea of a PR team being able to scrub someones history from the internet for example. Internet history shouldn’t be volatile, we should just be teaching people that they need to be careful about what they post because it’s going to be permanent.

Curious to know what others thoughts are on the Internet Archive and the obsession with backing everything up.


Internet archiving is no better or worse than any other historical archiving. Whether it’s good, bad, or indifferent depends on who is doing the archiving and what it’s being used for. At what point for example does the inarguably bigoted media of the past cross some kind of threshold of historical relevance where it’s worth observing as a key to the mindset of the time? At what point does an individual cross a line in the sand that shifts from just being a private person trying to live life to a public figure worthy of discussion and debate?

I imagine the biggest problem with web archiving is the same problem most web stuff has right now, the relative importance of everything is being penned mostly by white male tech-libertarians and what they view as useful. Something Awful will preserved in amber forever but the beginnings or significant chunks of black internet culture? Not a chance.


Agreed with that point. ^

The fact that corporations, governments, wealthy enough individuals can purge certain content from the internet - be it as innocent as a Nintendo fangame or as damning as classified intelligence - while people rarely can do more than delete-from-the-public or anonymize their account is pretty shit.

And as CrimsonBehelit points out, I don’t really trust a web traffic analyst’s non-profit or a tech startup to make the calls about what should or should not be archived. Especially with there being a lot out there that people did not consent to having on the internet (data theft, doxxing, kids who don’t know better yet posting too much personal information, etc.).

There’s probably a ‘no ethical internet archival under capitalism’ joke you could make here, IDK. I am glad internet archival is happening, because we’ve lost so much history, we’d lose so much more without it. Flawed as it is. But I think a lot about how it is done would have to change for it to be good.


I am super in favour of the Internet Archive, from the historical archiving perspective: but that’s partly because I don’t trust individual owners of important sites to archive things themselves.
Because a lot of our “cultural context” is solely on websites nowadays, losing some websites can be like having a small library, or cache of letters, burned down historically. But with websites, it can just be ‘oh, we stopped paying for the hosting’… and suddenly all is gone.
Sometimes it’s more deliberate, but still culturally harmful - Consider, for example, when Machinima were bought and their new owners told them to delete their entire YouTube channel!
(I wrote a talk about the importance of historical archiving of web content for one of the European Roller Derby Community’s annual conferences, when we had those, so…)


You’re not wrong which is why I started backing up some smaller older Half-Life 2 websites on my own because I noticed a lot were starting to drop off and found HTTrack to be really easy to use. I went ahead and even backed up Valves Developer Community wiki because that thing seems to always be 2 steps removed from death and it only came out to about 4.5GB. I think there’s a way to mirror back to the Internet Archive? I know someone did it for the entirety of the Garry’s Mod workshop earlier this year and it was something like 1.7TB. So there is a way, just not sure if that way is easy or not.

At some point I really want to try and archive as many BSP files as possible just because I don’t think a lot of people are thinking about CSS/TF2 mapping history.

This immediately reminded me of what happened with the Dota AllStars forum, which at this point I think I would call it destruction of a large and historically important gaming community for the sake of profit.