Yarn

Recent twts in reply to #momapxa

Anyone know of a tool that will crawl a website, run JavaScript, and then save the resulting DOM as HTML?

I tried Wpull, but I can’t get it to stop crashing on startup and development seems to have stopped.

I’m sure there’s a joke to be made about Python here.

⤋ Read More

@prologic@twtxt.net I’m trying to make a static local mirror of MDN Web Docs. It’s all free information on GitHub, but the whole system is extremely complicated.

<​tinfoil-hat>I think it’s so they can sell more MDN plus subscriptions, making people use their terrible MDN Offline system that uses the local storage of your browser.<​/tinfoil-hat>

At this point, I’m willing to run a local dev server and just save each generated page and its dependencies.

I really only need it to run JavaScript so it can request the browser compatibility JSON. It’s https://github.com/mdn/browser-compat-data but the MDN server, annoyingly, transforms it.

Once the BCD is rendered statically, I should be able to remove the references to the JavaScript.

That will solve another issue I’m having where the JavaScript is constantly trying to download /api/v1/whoami, which seemingly has no purpose aside from user tracking.

⤋ Read More

@prologic@twtxt.net That’s awfully nice of you, but you don’t need to do that. I know you’re a busy guy.

I’m sure I can find something if I look around some more. I can’t be the only one that wants to make a static mirror of a dynamic website.

⤋ Read More

@prologic@twtxt.net What I need it to do is crawl a website, executing JavaScript along the way, and saving the resulting DOMs to HTML files. It isn’t necessary to save the files downloaded via XHR and the like, but I would need it to save page requisites. CSS, JavaScript, favicons, etc.

Something that I’d like to have, but isn’t required, is mirroring of content (+ page requisites) in frames. (Example) This would involve spanning hosts, but I only need to span hosts for this specific purpose.

It would also be nice if the program could resolve absolute paths to relative paths (/en-US/docs/Web/HTML/Global_attributes -> ../../Global_attributes) but this isn’t required either. I think I’m going to have to have a local Web server running anyway because just about all the links are to directories with an index.html. (i.e the actual file referenced by /en-US/docs/Web/HTML/Global_attributes is /en-US/docs/Web/HTML/Global_attributes/index.html.)

⤋ Read More

Now I’ve just realized that if /en-US/docs/Web/HTML/Global_attributes is saved with that filename, the Web server is probably going to send the wrong MIME type. Wget solves this with –adjust-extension.

Man, you really don’t have to do this…

⤋ Read More

If I can get a proper static copy of MDN, I’ll make a torrent and share a magnet link here. I know I’m not the only one who wants something like this. I don’t think the file sizes will be so bad. My current “build” of the entire site is sitting at 1.36 GiB. (Only a little more than double the size of node_modules!) So, with browser compatibility data and such, I think it’ll still be less than 2GiB.

Aggressively compressed with bzip2 -9, it’s only 114.29 MiB. A compression ratio of 0.08. That blows my mind.

⤋ Read More

@mckinley@twtxt.net I can confirm the library “does what it says on the tin” 👌 I’ll put up my little CLI tool up for you to play with, its pretty damn stupid and basic right now as I’m not completely yet really sure how to flesh this out. Will need you to guide me on this, there’s probably a fair few nuances to writing a decent web mirroring tool (at least it does the right thing though and handles dynamic content rendered with Javascript – Which I tested by hitting my files.mills.io web app which has a pure JS frontend using MithrilJS)

Download

⤋ Read More

#mckinley@twtxt.net I’ve made a few more commits to mirror – But sadly its not currently as good as I’d hope. Turns out mirror the structure of websites is rather tricky? Maybe you have some tips to help? 😅 Anyway give it a whirl, very much pre-alpha.

⤋ Read More

Participate

Login to join in on this yarn.