Computing desk | ||
---|---|---|
< July 30 | << Jun | July | Aug >> | August 1 > |
Welcome to the Wikipedia Computing Reference Desk Archives |
---|
The page you are currently viewing is an archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages. |
I was looking at The Intercept and trying to figure out how to get the full list of articles they've published from the beginning. This turns out not to be very straightforward. They have some infinite scrolling script abomination that you can, in theory, just keep scrolling down and down and down and five hours from now you'd have the full list. But in the meanwhile, the browser would probably crash or something, and in the meanwhile, how would I put the list into a useful format? I'd just want to see the list of static URLs (and they do exist), by means of a simple script on some free site that would crawl back through the links using whatever JSON interface (I assume) the index lookups are stored in.
They have one main script in the body of the site as loaded, which (after a pass through jsbeautifier.org) contains various mentions of "fetch" that pertain to the process of getting new posts with a "slug", but I don't understand really what it's doing.
Also, there are a lot of really weird numbers and single-letter functions in this. I don't really know what I'm doing with this, but let's be clear: are they trying to make it hard to figure out their index system, or am I just clueless about how it works?
Also, at the other end: If you can manage to successfully cut and paste a huge block of infinitely scrolled web dreck out of your browser ... where can you PUT it to look at the content effectively? I mean, you could dump it in Notepad and get the text without any link information, or you could dump it in an Office clone and get something so weighed down with all the pictures and other HTML content that it would surely crash, or damn near crash, but is there some program you can dump it into that files and indexes the information in a way that you can go through conveniently? Or are there add-ons that extract every link on a web page and dump them into a text file? Etc.
Wnt (talk) 08:26, 31 July 2016 (UTC)
Use curl like utility in shell script to run in loops, incrementing the last parameter and redirect the output to a file, then grep the url you are looking for. This is the API that Ruslik was talking about. 120.63.227.88 (talk) 12:20, 1 August 2016 (UTC)