Recording my maddening process of iterating on a personal project.
YouTube Archiver Idea
I love watching YouTube, and I'm a bit of a data hoarder. For a while, I've wanted way to archive every video I've ever watched or put in my Favorites playlist. Of course, having an offline backup of a YouTube video is only useful if the video gets taken down.
In a grander sense, it would be nice to have access to a backup copy of any YouTube video that gets taken down. It's not feasible to download every YouTube video in the world. Somewhere around 3,700,000 new videos are uploaded every day. The idea behind this project is to massively crawl content aggregators for YouTube links, fetching the metadata for these videos and putting them in a database. On a reoccurring basis, it could check to see which videos have been removed. With a large enough dataset, you might be able to determine which videos are at the highest risk of being removed, and prioritize downloading them first.
TLDR: What I'm proposing is collecting a dataset of YT videos, classifying which videos are more likely to be taken down, and prioritizing those vids to be archived.
Things I Cover
PRAW (very briefly)
Web scraping with lxml
yt-dlp
MongoDB