Wikipedia:Request a query/Archive 5 Source: en.wikipedia.org/wiki/Wikipedia:Request_a_query/Archive_5
This is an archive of past discussions on Wikipedia:Request a query. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current main page.
I realize PetScan can be used to show links that appear on two pages, but I have a need for it to show such a result sorted by the order in which they appear on the first page. Based on my limited understanding of the database, I don't think the order of the links on a page are tracked. But just in case I'm wrong, can someone show me a query to do this? Or suggest an alternative approach? Thanks. Stefen Towers among the rest!Gab • Gruntwerk23:49, 22 June 2024 (UTC)
No, that's not possible; Quarry doesn't contain information about the content of the page.
Could you give some additional information about what you want this for? There are several alternative approaches that would work, but I need to know a little more about what you want it for. BilledMammal (talk) 23:53, 22 June 2024 (UTC)
It can see which pages a given page links to (or is linked from), but doesn't have any information on what order it's done in - that would need the page content.Is this a one-time query, or will you need it repeated? —Cryptic00:08, 23 June 2024 (UTC)
I see you want something that can run repeatedly. I don't have time right now to put something together for you, but if Cryptic doesn't come up with something I'll do it sometime in the next couple of weeks - if I don't, feel free to remind me on my talk page. BilledMammal (talk) 00:24, 23 June 2024 (UTC)
Thanks for the list - that's good for a start. Would you mind giving me a few clues on your approach for the script you did? It might snap me into figuring it out. Also, I have thought of using a spreadsheet or text compare software, but was hoping for an on-wiki or otherwise online approach. Stefen Towers among the rest!Gab • Gruntwerk00:40, 23 June 2024 (UTC)
A few Regex operations to get just the links in order, and then a basic python script that works down the first list and if the item exists on the second outputs it. Unfortunately, nothing online ATM. BilledMammal (talk) 00:49, 23 June 2024 (UTC)
Any pure-sql-against-the-wmf-databases approach would have to start with something similar to either "Make a page in your userspace with redlinks to 1!Tom Cruise, 2!Muhammad Ali, ... 1000!Frank Torre" or "manually edit this stupidly long query that includes all that data" (like how quarry:query/81948 includes the namespace names, but with a thousand items instead of 30). —Cryptic01:04, 23 June 2024 (UTC)
Indeed, those don't seem like tenable approaches. But this discussion has helped me figure out a solution, not optimal but workable:
Copy popular articles wikitext into a text editor, and break down into a flat list using a macro I constructed with regex and other tricks.
Use PetScan to create a flat list of articles with old issues (this didn't have to be in any particular order).
Insert both flat lists into their own column in a spreadsheet, then find matches of first column entries in the second column, then apply a filter of matches, and voila.
In 2013, I heard that about a quarter of newcomers' first edits were to create a page, and three-quarters were to edit an existing page. (My own first edit was to a Talk: page, which was a distinctly unpopular choice.)
For editors who made their first edit during 2023, is it possible to find out how many of those first edits were page creations vs editing existing pages, and which namespaces they happened in? I imagine a set of numbers like this:
Lots of ways for this to go wrong. Consider deleted edits, and that there's no way to reliably tell where a page was when an edit was made to it (unless it's also in the page creation log, but that still means at least half the data is bad). Also no way to tell which of two deleted edits with the same timestamp was actually earlier, but a user making a second edit in the same second as their first is going to be rare enough that we can just pick one. Getting a list of users whose first edit was in 2023 is impractical; a list of users created in 2023 and have made at least one edit is at least close, so that's what I'm doing. quarry:query/84486 should finish in about 15 minutes. It may or may not be right - I haven't seen the results yet - but I've got to get to bed. Will take another look tomorrow. —Cryptic07:47, 29 June 2024 (UTC)
Hm, corrected query finished a lot faster than the obviously-broken ones I ran before (which is usually to be expected, but I didn't think the data would cache that well). Something's not quite right - there shouldn't have that row with the completely blank namespace and 52458 existing-page edits, for starters, and that one creating a page in the MediaWiki namespace looks really suspicious - but otherwise the numbers seem at least plausible. —Cryptic07:59, 29 June 2024 (UTC)
If we ran that for, say, last week, would that give us an estimate of how many first edits are being 'lost' or 'misplaced' due to deletion and moving pages? WhatamIdoing (talk) 03:34, 30 June 2024 (UTC)
Well, I can run the numbers (quarry:query/84512), but I don't know how much to infer from the comparison with the longer time period. The only edits that are going to be 'lost' are ones that have been revdeleted or oversighted, and in most cases - other than, perhaps for copyvio revdels - that will happen pretty quickly after they're made. I wouldn't care to guess how quickly a typical page gets moved between namespaces after creation, either. —Cryptic08:17, 30 June 2024 (UTC)
Does it only lose revdel edits, and not ordinary/whole page deletion?
Just revdelled and suppressed edits. Everything about deleted pages and edits except for edit summaries and page text is in the public database replicas; accounting for those is most of why the query is as complex as it is. —Cryptic10:52, 2 July 2024 (UTC)
So here are a few things I notice:
Last week, 4,754 newbies made their first edit on an existing page and 1,842 newbies made their first edit to a new page.
Last week, if your first edit was to an existing page, then about 90% of the time, it was to the [i.e., a page most recently in] mainspace.
Last week, if your first edit was to create a new page, then about 70% of the time, it was to the User: namespace and about 20% of the time, it was to the Draft: space.
Within the first week, about a third of User: space pages and a third of Draft: space pages get deleted.
In 2023, 342,087 newbies made their first edit to an existing page and 144,181 newbies made their first edit to a new page.
In 2023, if your first edit was to an existing page, then about 90% of the time, it was to the mainspace.
In 2023, if your first edit was to create a new page, then about 70% of the time, it was to the User: namespace and about 20% of the time, it was to the Draft: space.
Within the last 18 months, looking at pages created on the first edit during 2023, about a third of User: space pages and almost 95% of Draft: space pages were deleted.
Conclusion from last week vs 2023: If your User: space page is going to get deleted, it'll happen in the first week. Draft pages are much more likely to get deleted, but it takes longer.
Within the last 18 months, only 1.6% of pages created as the account's first-ever edit were moved to the mainspace. 20% of those were also subsequently deleted. One in 75 first-edit page creations are still visible in the mainspace.
Actual User: pages probably aren't meant to get moved to the mainspace, and the existence of these pages depresses the overall 'success' numbers. An unknown proportion User sandbox pages probably are meant to move to the mainspace, while others are probably meant to be used for test edits. We don't have a good way to differentiate between these two types of user space contribution.
Within the last 18 months, looking at pages created on the first edit during 2023, about a third of User: space pages and almost 95% of Draft: space pages were deleted. - this is a shocking statistic but if I understand correctly it can't be taken at face value because the namespace is where the page is now, right? That is, 95% of drafts that were never moved to mainspace are deleted, which is explicable in terms of G13 (I assume the 5% is drafts less than six months old?) – Joe (talk) 20:35, 2 July 2024 (UTC)
@Joe Roe (mobile), all of the drafts created in 2023 are more than six months old at this point. There were probably one or two drafts that were a few hours short of being six months old when the query was run, so it's like 0.05%. The 5% that have been retained were probably created later in the year and had an edit made (by anyone) since then. For example, the 5% probably includes articles that were submitted to AFC in December, declined in January, and will be deleted – just not quite yet.
The namespace listed is the namespace in which either the page exists now (more specifically, at 30 Jun 2024 08:21:10 UTC) or where the page existed at the time that it was deleted (e.g., ordinary CSD or AFD, not revdel or OS/suppression – those latter ones are invisible to the query). WhatamIdoing (talk) 21:14, 2 July 2024 (UTC)
This can't be done with a sql query - interwiki links aren't in the main database, and jawiki is on a different database server than enwiki. Someone more familiar with WP:PetScan than I might be able to get a result with it. Also worth asking at d:WD:RAQ; they do have access to the interwikis over there, and I seem to recall there's a way to get at wikiproject ratings as well. —Cryptic17:13, 16 July 2024 (UTC)
Erm. Wow. That's egg on my face. I'm sure I've looked for them before, and had thought that table went away when language links were migrated to Wikidata. Hang on. —Cryptic17:25, 16 July 2024 (UTC)
I'm pretty sure not, unless it was depopulated by a bot, or was old enough that you can find the cat page with the Wayback Machine. —Cryptic19:27, 17 July 2024 (UTC)
Composition categories without templates
Hi all, if possible, I'd like to request a list of any subcategories in Category:Compositions by composer (which have more than 2 entries) in which there is no corresponding navigational box created for said composer's works.
So essentially, I'd be looking for categories like Category:Compositions by Example Person which have more than 2 pages, but there is no corresponding {{Example Person}} that exists. I'd then go through the list and create the missing templates myself.
Thank you so much @Cryptic, that should work well. Question: will this link stick around or should I copy the results somewhere for my own use? Aza24 (talk)00:24, 29 July 2024 (UTC)
It'll stay in place (and the results won't change even when the underlying data here does). It would probably be convenient to copy it into a sandbox page, though, so you can click on the links directly. The cyan "Download data" dropdown has an option for wikitable format. —Cryptic00:29, 29 July 2024 (UTC)
Not possible with a query: the table showing transclusions doesn't have duplicates (it's the same one used to generate Special:Whatlinkshere), and the replicas don't have the page text. Best I can think of is using search one template at a time, like so. —Cryptic18:58, 27 July 2024 (UTC)
I'm interesting in finding the articles that make heavy use of Template:IETF RFC. Would it be possible to rank the top-10 articles with the most calls ("transclusions"?) of that template, please? Thanks! fgnievinski (talk) 01:20, 15 August 2024 (UTC)
It might be feasible to do it manually, though - there's only 817 articles that transclude it. You could scrape the wikitext for all of them, parse them for transclusions, and count those. There's only a couple redirects to confuse things, though if there's any indirect transclusions (i.e., through a different template), that'd make things harder. —Cryptic01:35, 15 August 2024 (UTC)
Both of those projects are almost certainly pulling their data from the same place, hence with the same limitations. —Cryptic03:01, 15 August 2024 (UTC)
As above, search can sort of do this. this looks for "{{", zero or more spaces, "ietf", any single character, then "rfc", case insensitive, with all of that occurring at least forty times in the same mainspace page (both redirects to {{IETF RFC}} with any transclusions at all match that pattern). It finds 13 articles. It could very well be missing some, though; for example, {{IETF RFC|1234}} with extra spaces in the middle would render normally but not be searchable by this. —Cryptic03:34, 15 August 2024 (UTC)
Hello lovely volunteers. I'm trying to calculate 1) The number of articles deleted in 2022, and 2) The percentage of article deletions in 2022 that were the result of an AfD discussion. I think existing queries do this but would appreciate it if someone could check my interpretation of the queries.
Here are my assumptions:
My definition of an "article" is a mainspace page that is not a redirect. I don't mind including dab pages, pages with no links, etc.
I assume that all articles deleted as a result of an AfD discussion have "Articles for deletion" in the log reason
From quarry:query/78694 I see that in 2022, 109,583 mainspace pages were deleted in 2022
From quarry:query/78460 I see that in 2022, 37,297 pages were deleted with "redirect overwrite" in the log reason
From quarry:query/78460 I see that in 2022, 5380 pages were deleted with "Redirects for discussion" in the log reason
From quarry:query/78460 I see that in 2022, 13,635 pages were deleted with "Articles for Deletion" in the log reason
I assume that the number of mainspace redirects not accounted for in the above queries is negligible or too difficult to get (is this true?)
Therefore it appears that:
In 2022, 66,906 articles were deleted (109,583 - 37,297 - 5380), an average of 183 per day
In 2022, 13,635 articles were deleted via AfD, an average of 37 per day
In 2022, 20% of article deletions were the result of an AfD discussion.
#3 and #4 are accurate; #5-7 aren't. And #3 isn't either if you'd count a move out of mainspace without leaving a redirect as deletion.Query 78460 doesn't do quite what you think it does. It's not showing the total number of deletions whose logs mention A7, the total number that mention AFD, etc; it tries to assign each deletion to a single reason in a given order. So if, say, a page was deleted with comment "Wikipedia:Articles for deletion/Spacely Sprockets; also a WP:CSD#G11" it would be counted as G11 and not AFD.And there's no way to find out if a page was a redirect or not when it was deleted, other than if the comment mentions an R-series criterion (which that query doesn't look for) or RFD, or if it's the automatic deletion during a page-move. You could conceivably look at the length of the most recent deleted revision, but there's lots of redirects with more bytes in them than lots of short articles. I'd have no confidence in any query of the public replicas that purported to accurately count the number of redirects deleted. But if you were trying for a SWAG, you can still do a lot better than this query - sum the automatic overwrites, plus the R-series speedies, plus the RFDs, plus some proportion of appropriate-looking G6s and G8s (G14s using its redirect clause aren't ever distinguishable from other G14s in my experience). —Cryptic20:09, 26 August 2024 (UTC)
Excellent points, thanks. Re #6, I also just realized that when an article is deleted at AfD, its Talk page(s) and redirects to it are also deleted with "Articles for Deletion" in the log reason. The best way to count articles deleted via AfD is User:JPxG/Oracle as far as I can tell. According to that page, in 2022, the daily average number of AfD deletion discussions that resulted in “Delete” or “Speedy Delete” was 33. I'll work on refining this. Cheers, Clayoquot (talk | contribs) 23:50, 27 August 2024 (UTC)
56, give or take. That's not quite what you're asking; it's the number of current admins (including adminbots) with more than 1000 edits in the last 30 days. I didn't crossref the "active admins" page since they should be mostly the same, give or take recent desysops (Pppery would've just missed the list with 956 edits) or re-activations. Going back more than 30 days is much slower, roughly five or six minutes per month (including the first). —Cryptic23:07, 26 August 2024 (UTC)
By the way, 1000 edits a month is a lot. I used to hit this back when I was backlog crushing in 2021. Xtools. Nowadays I only watchlist, then spend the rest of my time on coding which doesn't really increase my edit count much, and I no longer hit 1000 edits a month, even though I edit every day. So be careful, such a high threshold may exclude some active admins such as myself who spend hours on wiki every day. –Novem Linguae (talk) 10:17, 27 August 2024 (UTC)
I agree. I'm rarely in the 1,000+ edits a month category myself. [3] If we have good admins that don't have crazy high edit counts once they get the bit, it stands to reason that we're missing out on some good potential admins due to editcountitis. The whole point of this query was inspired by my essay and my comments about it in the above thread. Clovermoss🍀(talk)10:25, 27 August 2024 (UTC)
Groovy. If it helps, I think the de facto edit counts to pass RFA nowadays are 8000 total (due to 0xDEADBEEF passing with this recently, and no one passing with anything lower since GoldenRing in 2017) and a couple hundred edits a month for the last 6 months or so (enough to show that you aren't inactive). –Novem Linguae (talk) 17:05, 27 August 2024 (UTC)
@Clovermoss: If you have a candidate for adminship in mind I'd be happy to show them how to fix typos on a large scale. There are many tasks (that may be a bit behind the scenes and boring) that require making many edits. Polygnotus (talk) 17:24, 27 August 2024 (UTC)
Whereas I've averaged 1000+ edits per month almost every year since getting the bit, and I don't appear on Cryptic's list of 56. Been doing a lot of gardening this summer, I guess. :D Valereee (talk) 11:41, 28 August 2024 (UTC)
Valereee, your count from a similar query without the 1000-edit cutoff is 584. That's roughly correct looking at your contributions; it's a simple count of edits in the past 30 days. It won't have counted the two deleted edits you have from that period. —Cryptic12:30, 28 August 2024 (UTC)
Oh, yes, I typically do have fewer edits this time of year. I was more thinking of averages -- that is, if we're looking for potential candidates who average (say) 1000 edits per month, we wouldn't capture even someone who averages more than 1000 but not in the past month, if you see what I mean. I tend to do more editing when it's dark 13 hours a day, so in January I'd almost definitely be captured. In a typical August, possibly not. Valereee (talk) 12:47, 28 August 2024 (UTC)
Well, by happy coincidence, I just ran this query. It shows admins (again, only users who currently have the bit) who have at least 1000 live edits from at least one of the past four calendar months, i.e. April-July. 100 qualified in at least one month; 60 of them, including at least three bots, averaged over 1000 for the entire period. —Cryptic12:50, 28 August 2024 (UTC)
Oh, wow, that is fascinating, thank you! 3 our of 5 who made 1000 edits at least once in the past four months also averaged more than 1000 over that period! People who are in are in, I guess?
Could we run a similar for non-admins with, say, at least a year's experience, at least 10K edits, and 1000 edits at least once over the past four months, or would that be a headache/turn out a way too huge number? Valereee (talk) 13:04, 28 August 2024 (UTC)
It'll come up at quarry:query/85881 eventually. That's going to be a lot slower, since it has to look at all edits in that time period instead of just those made by the 854 current admins. —Cryptic13:13, 28 August 2024 (UTC)
section headers starting with a lowercase "w" followed by a uppercase letter or a number
I discovered something weird, see Wikipedia:Help_desk#Many_W's. Is it possible to run a query or write a regex search or whatever that meets these criteria:
Finds section headers (no matter how many levels deep) that start with a lowercase "w" immediately followed by a uppercase letter or a number. Polygnotus (talk) 15:05, 28 August 2024 (UTC)
The database replicas don't have article text, so the only place these would show up is if there are redirects to those sections.Regex searches aren't indexed, so don't work well on their own; you have to pair them with something that is, like a category or template or outgoing link or normal search. But if you had that, I'd think insource:/== *w[A-Z0-9]/ would find just about all of these; it'd miss weird space characters, which should be uncommon anyway, and level-1 headers, where you'd get too many false positives with template parameters. Example search, paired with "meters". —Cryptic15:23, 28 August 2024 (UTC)
Thank you! I think I have them all. I am not worried about level 1 headers because I think that every header that follows this pattern is a subheader of something like "Women's events" or "Women's results" or similar. Polygnotus (talk) 15:34, 28 August 2024 (UTC)
All categories with "LGBT" in their title
Would it be possible to get a list of all categories with LGBT (without a Q) in their title? They all need to be WP:C2D-renamed to change from LGBT to LGBTQ following Talk:LGBTQ#Requested move 14 August 2024, and a query which can be redone at will to get the currently-outstanding list would be helpful. Thanks, HouseBlaster (talk • he/they) 03:08, 31 August 2024 (UTC)
I would like a sentence in Wikipedia:Statistics#Page views that says something like "As of 2024, the median page views for an article is 6 per day, with n% of articles falling into the range of 3 to 50 page views per day" (where 50% ≤ n ≤ 90%, to show the middle range). I think a sample from 10,000 articles (e.g., User:BilledMammal/Average articles) or even just 1,000 would be sufficient.
View counts aren't exposed in the public database replicas AFAIK, so this isn't possible other than by hitting the API once for each page. —Cryptic19:24, 10 September 2024 (UTC)
After excluding ~200 pages with 0 reported page views from 2023 (because they might have been created during 2024), I find: mean of 5,725 page views per year, median of 55 page views per year, mode of 1. 25% of articles get 3 page views per year(!) or less. 75% of articles get 570 page views per year — or less. Almost exactly 10% of articles average 10+ page views per day. WhatamIdoing (talk) 21:05, 10 September 2024 (UTC)
Cross Reference Request
Would someone be able to cross-reference the links in this list Green Bay Packers All-Time Roster with to verify that every target page in that list is also categorized in that category. The goal for me here is to make sure the links in the list are targeted to the right article. As an example, the list had Rex Smith, when it should have had Rex Smith (American football) as he correct target. Thus, this cross reference will identify any disambiguation that needs to occur on that page for links that aren't disambiguation pages (which I have already fixed). Thank you! « Gonzo fan2007(talk) @ 22:14, 18 September 2024 (UTC)
Just currently-submitted ones, or ones that have been accepted, too?If the first, you can see these by going to Special:Contributions/79lives, expanding "Search for contributions", picking "Draft" from the "Namespace" dropdown, and clicking the blue "Search" button. That gives you this.If the second, Special:Log/create/79lives comes close. —Cryptic13:09, 4 October 2024 (UTC)
how many transclusions of a list of templates
Yeah, I can repeatedly fill in the blank but I'd really rather not have to do that. But, I don't know anything about database queries; hence my request.
Given a list of templates, is it possible to fetch each template's transclusion count and return both the template name and the count for each one? The templates that I have in mind are a filtered subset of the templates listed at Category:Lang-x templates.
The every template in that category query is just fine; I should have said that the category is lightly filtered so those that I don't care about can be removed by hand.
Not via Quarry, but I could do it via wikitech:PAWS. Excluding Meta, MediaWiki.org, and testwiki which aren't content projects so have lots of false positives, this gives the following results:
how many long the account was active for (e.g., if the person edited from 2010 to 2020, then say "10 years").
I want to write sentences that say things like "10% of our most-active editors are no longer editing" and "Among these former editors, most of them edited for about 10 years before quitting or being blocked". Ultimately, I'm hoping to use it as a way to ballpark a replacement rate for high-volume editors. This will be imperfect, but it might give me a very general idea. (Better suggestions are welcome.) WhatamIdoing (talk) 07:19, 14 October 2024 (UTC)
The only userpage linked from that list who hasn't edited in 2024 is Geo Swan. 6330 days between first and last edits, about 17 years four months. —Cryptic14:42, 14 October 2024 (UTC)
The reason for that being that users who haven't edited in a while are normally unlinked. For some mysterious reason Geo Swan is still linked, though. And Geo Swan is a bad example as they were banned (over a single incident that in my opinion did not warrant a ban) not chose to stop editing. * Pppery *it has begun...16:31, 14 October 2024 (UTC)
Well, that's inconvenient, though I suppose that it makes it easier to figure out which editors are inactive. 727 out of 1,000 accounts are still linked; therefore, 27% of editors who have made the most edits are inactive. (There are also 11 "placeholders" and a couple of blocked accounts, so ±2%.)
I discovered why those usernames are unlinked independently, after a couple iterations of a query that takes an hour and a half to complete. Très annoying. If you make a user subpage linking to just the unlinked users (and Geo Swan, too, I guess), I can rerun it against that. —Cryptic22:17, 14 October 2024 (UTC)
Iff that's typical for the lifespan – and it might very well not be, in which case, it is almost certainly an underestimate – we may need to double that. I've previously estimated that our current retention rate gets us about enough folks each year to replace 4% of the people who have made 100K edits, or 25 year for full turnover. WhatamIdoing (talk) 21:39, 25 October 2024 (UTC)
List of all talk pages matching "Archives\s*\/\s*\d{1,3}"
Usually archive pages on Wikipedia are of the format "/Archive 1", "/Archive 2",... Often when talk pages are moved, the mover does not update the Archiving instructions for the bots. This causes the bot to send sections to archives titled "Archives/ 1", "Archives/ 2", breaking the archiving pages pattern as well as sequence. For example, the last archival before the move might be to "Archive 4". After move, newer sections go to "Archives/ 1". In order to fix them, I would need this query. Thanks! —CX Zoom[he/him](let's talk • {C•X})20:06, 2 November 2024 (UTC)
While making this edit recently, it occurred to me that we ought to have a way of at least semi-automatically identifying and tagging articles with either a single or no sources. I'd like to be able to do an AWB run of likely such articles.
Given that there are many different ways to do sources, I'd like to start with a conservative query, which lists all articles that contain none of the following strings:
<ref
http://
Notes
cite
Reference
Sources
Citation
Bibliography
sfn
I don't know how to construct a RegEx query with a negative (the internet seems to have some ideas, but I struggle to convert this into Wikipedia's flavor), so I'd appreciate some help. Could anyone help me generate this list? Cheers, Sdkbtalk05:14, 14 November 2024 (UTC)
If someone comes here looking for help with Elasticsearch's middle-end, they're going to be very, very disappointed. —Cryptic08:13, 14 November 2024 (UTC)
Thanks, @Pppery! After expanding the query to -insource:/([Rr]ef|http|[Nn]otes|[Cc]ite|[Ss]ources|[Cc]itation|[Bb]ibliography|sfn|list of|lists of|link|further reading|Wiktionary redirect)/ -intitle:list -deepcategory:"Set index articles" it's starting to turn up mostly useful results. Cheers, Sdkbtalk07:17, 14 November 2024 (UTC)
You can get more results before it times out by adding more non-regex filters. For instance, adding -hastemplate:"Module:Citation/CS1" gives 15k results instead of just 2k. – SD0001 (talk) 07:39, 14 November 2024 (UTC)
Anyway, the sort of things this page can do to answer your original question are to give you lists of pages with zero, or zero or one, external links, or that don't transclude any of a set of templates, or both; and as a bonus filter out redirects (which I'm fairly sure search does whether you like it or not), disambigs, and - to some extent - list pages. —Cryptic07:16, 14 November 2024 (UTC)
https://www.w3schools.com/sql/sql_isnull.asp indicates that my syntax should be valid. Two alternative ways to do the same thing? Regarding the "prior noise", I'm a more competent administrator who's checking page histories, and leaving redirects within user space alone. My current focus is on cross-namespace redirects from user pages of nonexistent users to outside of userspace. My recent deletion log will give you an idea; I'm trying to make a more specific query to reduce the noise level in the query results I've been working from. – wbm1058 (talk) 20:53, 14 November 2024 (UTC)
MariaDB supports ISNULL(), and it works the way Wbm1058 was trying to use it (modulo the misplaced space). SQL Server's ISNULL() is a synonym of COALESCE() instead. x IS NULL is generally safer precisely because of that incompatibility. —Cryptic21:29, 14 November 2024 (UTC)
I tried just changing the syntax of the "IS NULL" statement as suggested. It was cooking on that for a while, and then:
"Error
This web service cannot be reached. Please contact a maintainer of this project.
Maintainers can find troubleshooting instructions from our documentation on Wikitech."
It just ran to completion, so simply changing the "IS NULL" statement fixed the syntax error. Now on to figure out the results, and tweak the query to do what I really want it to do. Thanks for your help. wbm1058 (talk) 22:09, 14 November 2024 (UTC)
FYI, I'm now feeling the joy. User:Wbm1058/Userpages of nonexistent users is my report of 400 pages which I think may all be safely speedy-deleted under U2: Userpage or subpage of a nonexistent user. This report was culled from a report of 1960 pages, by INTERSECT with the user table SELECT. This is indicative of the poor page-move interface design, which leads editors who think they're publishing user drafts to keep pages in userspace when they really wanted to move to mainspace, because they neglected the namespace dropdown in the move-page user interface. – wbm1058 (talk) 14:11, 15 November 2024 (UTC)
Dusty articles within all subcategories of a parent category
Is this possible? I'd like to get a list like Special:AncientPages but for anything within any subcategory of Category:Food and drink. It would make quite a nice little To Do list for times I feel like doing some research and writing but don't have a particular bee in my bonnet that very minute. Thanks for any help! Valereee (talk) 15:16, 17 November 2024 (UTC)
Yes, it's a list of articles by date of most recent edit.
Hm, on the second question. Ideally I'd end up with is a list of, say, food items that hadn't been edited in ten years. Or chefs, or restaurants, or food animals or whatever. Maybe I need to choose a more specific subcategory? Valereee (talk) 17:24, 17 November 2024 (UTC)
Not wasted at all. Not your fault the category system is terrible for datamining.There might be some value in finding the latest revision that wasn't marked minor, and maybe excluding ones made by bots too, but that's going to be harder and a lot slower. Would definitely need to cut the set of articles to look at to something on the order of a couple thousand before looking at the edits, rather than the tens of thousands in that first tree. —Cryptic20:14, 17 November 2024 (UTC)
Thanks. And I've actually already found an article that needs attention from your 87976 query, so win!
The point for me here is looking for categories that have many articles that haven't been updated since before sourcing started modernizing. It's a bit tricky because the articles that were created first -- probably in any category -- are also likely the articles that get update often, have multiple watchers, etc. So it's possible there just aren't huge numbers of food articles that need this kind of attention. Valereee (talk) 21:18, 17 November 2024 (UTC)
Number of articles that are actually articles
There are 7,015,007, but AIUI that includes disambig pages, stand-alone lists, and outlines, and maybe even portals (i.e., all content namespaces, not just the mainspace) but excludes redirects. Is there a way to get a count of the number of just plain old ordinary articles, excluding the other types? (A percentage from a sample set is good enough; I'd like to be able to write a sentence like "Of the 6.9 million articles, 6.2 million are regular articles, 0.45 million are lists, and 0.2 million are disambig pages.") WhatamIdoing (talk) 22:46, 17 November 2024 (UTC)
There is no clear definition of what a "regular article" is. Also many pages are not correctly marked and categorized. Don't for WP:INDEXES which look like ordinary articles, or might be, depending. -- GreenC00:43, 18 November 2024 (UTC)
{{NUMBEROFARTICLES}} seems to be mainspace non-redirect pages. I'd thought it used other heuristics, too; I remember needing at least one link, and less confidently requiring a period? But plainly doesn't anymore; I'm getting 6912240 for ns0 !page_is_redirect on the replicas now.There's only 362201 non-redirects in Category:All disambiguation pages and mainspace. Most of the difference are in other namespaces, probably legitimately, though I'm surprised to see 208 in user:, 44 total in various talk namespaces, 9 mainspace redirects, and a single redirect in draftspace.114253 mainspace non-redirects in Category:All set index articles, though 64 of those are in the disambig cat as well.Lists are less certain; there's no Category:All list pages. I could try to count pages that are in any category starting with "Lists " or ending with " lists", but - not having done that before - don't have any idea how many it would miss, and how many it would miscount. Ditto with pages starting with "List of " or "Lists of " (which is easy - 120653, not counting any redirs or pages in the dabs or set index cats). —Cryptic01:00, 18 November 2024 (UTC)
So 6,912,230 non-redirect pages, of which 362,201 are dab pages and 120,653 are Lists (per title), and the rest (e.g., Outlines, Index pages) is immaterial. A good SIA looks like an article and an underdeveloped one looks like a dab page, which takes us back to GreenC's point about it not being a clear-cut question.
All of this suggests that if you count SIAs as 'articles', then there are 6.429 million articles (93%) and if you count them as lists/dabs, then there are 6.315 million articles (91%).
Today's question is how old the typical currently active WP:EXTCONFIRMED account is. The requirements are:
is currently active (perhaps made at least one edit during the last 30 days? Any plausible definition of active works for me, so pick whatever is easiest and cheapest to run)
meets EXTCON (all of which will have the EXTCON permission)
I don't care whether it's date of first edit vs registration date. I also don't care whether it's all ~73K of them or if it's a sample of 5K–10K. I am looking for an answer to the nearest year ("Most active EXTCON editors started editing before 2014" or "If you see someone editing an article under EXTCON, they probably started editing more than 10 years ago").
Hmm. user_touched has been scrubbed because it is private data. So I guess LEFT JOIN recentchanges to see who is active? This should only get us users who have made an edit in the last 30 days. Then run MEDIAN() on it. Let's see if quarry:query/88037 finishes. If the count is 72,000ish, then I also need to add a WHERE to filter out the editors who aren't in recentchanges. –Novem Linguae (talk) 18:33, 19 November 2024 (UTC)
That's going to get you not just every user with the user right - the whole point of a left join is that you get a result whether there's a row in the joined table or not - but a row in your resultset for every row they have in recentchanges. And you're leaving out admins, who have extended confirmed implicitly. Plus, even if it worked, it would be a dump of ~25k values.Mariadb has a MEDIAN() window function, but I can't get it to work on user_registration no matter how I try to preprocess it first - it gives me "Numeric datatype is required for percentile_cont function" when I call it directly on the column, which is reasonable, but always 100000000 if I cast it to any kind of numeric value, which isn't. (Anyone know what I'm doing wrong? Sample query. And I've never really grokked window funcs or how to get them to behave like normal, sane, grouped-by aggregate funcs anyway.) But doing it longhand works just fine. quarry:query/88039: 28 May 2013. —Cryptic19:37, 19 November 2024 (UTC)
user_registration is ASCII-encoded binary rather than a binary number which is why you're getting nonsense when casting it and trying to do operations on it. Uhai (talk) 21:28, 19 November 2024 (UTC)
Casting it seems to get me a numeric, and doing normal arithmetic on it (user_registration + 1, division, and so on) coerces it to a numeric; it doesn't get me nonsense until I try to pass it through MEDIAN(). And UNIX_TIMESTAMP() in particular is documented to return an unsigned int. Current theory is that MEDIAN() can't deal with large numbers (see resultset 4; dividing by numbers smaller than ten gets me 100 million again), which is boggling. No, a cast or operation on the result of MEDIAN() is what fixes it. Still boggling. Cleaner query. Thanks for the prod. —Cryptic21:55, 19 November 2024 (UTC)
Also, comparing the results reminded me that user_registration is NULL for some users who registered before mid-2005ish, which I hadn't corrected for. 2013-06-15 19:42:14, though I doubt the two weeks' inaccuracy is going to matter much to WAID. —Cryptic22:30, 19 November 2024 (UTC)
TIL CAST seems to convert to the proper numeric representation if the binary string contains only numeric ASCII characters. Glad you were able to get it working though. Uhai (talk) 22:34, 19 November 2024 (UTC)
Hi everyone. Cryptic kindly created this query which shows how many draftifications took place between 2021-07 and 2022-08. Could someone please fork it to show dates from 2016 to 2024? If it's easier, I'm fine with seeing the number of draftifications by year instead of by month. Many thanks and best wishes, Clayoquot (talk | contribs) 03:38, 14 December 2024 (UTC)
Measuring the number of source links to each domain for a given article/set of articles
Command denied
I keep getting the error, "execute command denied to user 's52788'@'%' for routine 'enwiki_p.count'". I was using the page database, but even after I modified my query to only use the externallinks database (meaning I need to input a numerical page ID instead of using the title), I'm still getting the denial. What am I doing wrong here? Am I just not allowed to aggregate? Here's my query, simplified as much as possible and still not working:
SELECT count (el_to_domain_index)
FROM externallinks
WHERE el_from = 37198628
GROUP BY el_to_domain_index;
Now I'm trying to get counts for all the external links from all the pages in a category. I want to do this for each separate page, and get lists of all the actual URLs, but y'know, baby steps. I used this query: https://quarry.wmcloud.org/query/89031
USE enwiki_p;
SELECT el_to_domain_index,
count(el_to_domain_index)
FROM externallinks
JOIN categorylinks ON cl_from = el_from
WHERE cl_to = 11696843
GROUP BY el_to_domain_index
ORDER BY count(el_to_domain_index) DESC;
I'm not getting any results and it takes ages to not get them. What am I doing wrong now? Also, how do I include pages in any subcategories, or does this include them automatically? Safrolic (talk) 00:57, 22 December 2024 (UTC)
I figured out that I need to use page despite the slowness it'll cause, because cl_to uses a name instead of an ID. So here is my new query, now also running on simplewiki for easier testing. https://quarry.wmcloud.org/query/89032
USE simplewiki_p
SELECT page_title,
el_to_domain_index,
count(el_to_domain_index)
FROM externallinks
JOIN categorylinks ON cl_from = el_from
JOIN page on cl_from = page_id
WHERE cl_to = Canada
GROUP BY page_title, el_to_domain_index;
This query though has a syntax error on line 2.
I also think I might be in the wrong place to ask for step-by-step help like this. If there's a better place for me to go, I'd appreciate the direction. Safrolic (talk) 02:18, 22 December 2024 (UTC)
You don't need the USE statement on Quarry since you have to select a database there separately (since most are on different servers now); but if you keep it, you need to terminate it with a semicolon.Next error you'd get is that you need to quote 'Canada'. At least that one has a useful error message ("Unknown column 'Canada' in 'where clause'").The reason your first query took forever is because SELECT*FROMcategorylinksWHEREcl_to=11696843; does a full table scan - it tries to coerce each row's cl_to (a string value) into a number, and then does a numeric comparison. There's no correct way to use the index on cl_to since many different strings compare equal to that number, in particular ones starting with whitespace. SELECT*FROMcategorylinksWHEREcl_to='11696843'; on the other hand finishes instantly with no results (since Category:11696843 has no members). Categories are only loosely tied to the page at their title anyway.You won't get members of subcategories like that - you have to go far out of your way to do so, similar to quarry:query/87975. You would get the direct subcategories like simple:Category:Canada stubs themselves, if any happened to have any external links. Distinguish them by selecting page_namespace too, if you're not already filtering by it. —Cryptic02:56, 22 December 2024 (UTC)
It sounds like I'm better off doing a multipart kludge- getting all the relevant page titles with Massviews or Petscan, running a single query to turn them into IDs, then using those IDs as el_froms so I only need the externallinks database. Thank you for your help! Safrolic (talk) 05:59, 22 December 2024 (UTC)
Orphaned editnotices
When a page is moved, its editnotice is not moved with it. There is a post-move warning for it, but users would need to move it separately. That too can only be done by template editors, page movers and admins. I believe that there are plenty of editnotices that have become orphaned from their target page. I need a query to list such pages. If there is already a regularly updated database, that will work too. Thanks! —CX Zoom[he/him](let's talk • {C•X})07:53, 25 December 2024 (UTC)
I suspect it's much worse than that. It's certainly more complex.There's plenty of mainspace titles with colons in them, and it's conceivable that some of those have orphaned editnotices; there's really no way around parsing for the namespace name, and that's going to be ugly and complex, and I haven't tried it yet. (It being Christmas morning and all. Maybe tomorrow.) But I wouldn't estimate that to result in more than a handful of other hits.Much more likely is the case that CX Zoom mentions directly: a page is moved but the editnotice isn't, leaving it attached to the remnant redirect. There's going to be false positives looking for those whether we do it the "correct" way and look in the move log (since there might be an editnotice intentionally attached to a page that had another page moved from it in the past), or whether we include editnotices attached to pages that are currently redirects. The latter's easier, and especially easier to combine with the query looking for pages that don't exist; I've done it at quarry:query/89148. That'll also miss editnotices that unintentionally weren't moved with their page, where the resulting redirect was turned into a different page, though. —Cryptic15:30, 25 December 2024 (UTC)
In my travels I have come across some very long ref names in ref tags, sometimes automatically generated by incorporating the title of the work that is being referenced. Occasionally I will shorten excessively long ref names just to improve readability of the wikitext in that section. This has me curious as to whether it is possible to generate a list of the longest ref names being used in articles, as there are probably some likely targets for this kind of cleanup. Is it possible to either generate a list of the longest ref names in order of length, or barring that, a list of ref names that are more than, say, 50, or perhaps 75, characters? BD2412T02:49, 29 December 2024 (UTC)
References aren't in the database except in page source, which isn't available in the replicas. You can find some with search, like insource:ref insource:/\< *ref *name *= *[^ <>][^<>]{76}/. That's going to miss a bunch of them, most obviously any refs crazy enough to include angle brackets in their names (though those mostly seem to be errors anyway) or ref syntax using non-standard spaces, but ElasticSearch's gratuitously awful regex engine can't do a whole lot better. Also won't find ref names populated through templates - I understand some infoboxes do this. —Cryptic05:26, 29 December 2024 (UTC)
This is definitely plenty to start with. It is crazy that there are so many lengthy ref tags. The Wikipedia article was the most references has under 1000 of them, and if every ref name was made of an arbitrary combination of letters and numbers, they could all be handled with two character ref names. BD2412T18:44, 29 December 2024 (UTC)
Australia Project
I am interested to know how the Australian project is progressing.
Number of
articles created
articles deleted
edits by editors with/without Australia user boxes.
Which specific user boxes? What time frame? Articles those users have deleted, or articles those users created that anyone deleted? Counting edits by editors without specific user boxes is Right Out; it's going to be well over a billion, would take days to count if the query didn't die (it would), and would be useless for any purpose. —Cryptic16:46, 2 January 2025 (UTC)
I bring https://quarry.wmcloud.org/query/89411 back to haunt you. I'm trying to assemble a table of registered editors per year. I have figured out how to modify the query to pick a different year. But what I want now is the number of registered editors in each year who made 10+ edits during that year (so, 10 edits in 2024 counts, but 5 edits in 2023 plus another 5 edits in 2024 does not), 100+ edits, and 1,000+ edits.
I can't think of a way to do this that doesn't look at every edit in the time range. I think it's likely the query will die. But then, it managed to complete for one month (January 2024), so maybe quarry:query/89557 for all of 2024 will eventually too. No counts of currently-deleted edits this time. —Cryptic12:04, 5 January 2025 (UTC)
Maybe this is something that would have to be done by the WMF's Analytics team. I can get "one edit this year" from the original query that you wrote for me, and I'm currently slowly walking it back. It takes ~40 minutes to run, plus several hours for me to remember to check it.
Better to fork it. The time for deleted edits isn't going to change much - it has to do a full table scan, since there isn't an appropriate index - but it's still the live edits that take the bulk of the time, and that is improved by narrowing the timespan looked at. —Cryptic05:13, 6 January 2025 (UTC)
Okay. I forked it to quarry:query/89581, changed the years from 2024 to 2023, and set it to run again. If this works, then I can repeat that step a dozen times.
That's expected, regardless of which earlier query you mean. Query 89557 will have fewer than 89569 because there's plenty of users who have deleted edits in 2024 but no currently-live ones. One based on 89411 will have very slightly fewer because the views of the revision and archive tables it's looking at are slightly more heavily redacted than the ones 89557/89569 use. —Cryptic05:35, 6 January 2025 (UTC)
89411 tells me there were 812,635 editors in 2023.
89581 (forked from your new 89569) tells me there were 11 fewer editors in 2023.
But the same scripts for 2024 vary in the opposite direction. The second script finds 29 more editors in 2024. Mostly the second script seems to be finding a small handful more editors (2 to 50) in each year. WhatamIdoing (talk) 02:46, 7 January 2025 (UTC)
I don't see a run in 89411's history for 2023. quarry:query/80211 shows that figure, but it was run almost a year ago. I've just rerun it, and it's giving me 812621, which is both more consistent with the 812624 from the new query and offset in the right direction. Or less wrong direction, anyway.I'm reasonably confident that the reason we're getting fewer numbers from the same queries now compared to a year ago is because there's been more revdeletions and suppressions in the meantime; neither query can see such edits. It doesn't surprise me a bit to find out there's been revdeletions of 2023 edits made after early February 2024. I'll run a comparison of users that would be seen by the two queries so we can have a better idea why they show up in the second and not the first; it's going to take a while though. —Cryptic04:25, 7 January 2025 (UTC)
(Which, of course, isn't that helpful an answer, so I've updated the query in-place to exclude talk pages that are redirects. —Cryptic17:43, 15 January 2025 (UTC))
Serial commas in page titles
I posted the following request at WP:BOTR and was advised to come here.
Extended content
Hello, I'm not sure that this request can be completed automatically; please accept my apology if it can't. I just want some lists, without edits to anything except the page where you put the lists, so it's not a CONTEXTBOT issue: just a "good use of time" issue. Could you compile some lists of pages in which serial commas are present or are omitted? I just discovered List of cities, towns and villages in Cyprus and created List of cities, towns, and villages in Cyprus as a redirect to support serial commas. Ideally, whenever a page could have a serial comma in the title, we'd have a redirect for the form not used by the current title, but I assume this isn't always the case.
First off, I'd like a list of all mainspace pages (whether articles, lists, disambiguation pages, anything else except redirects) that use a serial comma. I think the criteria might be:
[one or more words]
comma
[one or more words]
comma
["and" or "or"]
[one or more words]
I'm unsure whether they're rigid enough, or whether they might return a lot of false positives.
Secondly, I'd like a list of all pages whose titles are identical to the first list, except lacking a serial comma. Redirects would be acceptable here, since if I'm creating serial-comma redirects, it helps to know if it already exists.
Thirdly, I'd like a list of all mainspace pages (whether articles, lists, disambiguation pages, anything else except redirects) that could use a serial comma but don't. I think the criteria would be:
[Page is not on first or second list]
[one or more words]
comma
one or more words]
["and" or "or", but no comma immediately beforehand]
[one or more words]
Once the list is complete, the bot checks each page with the following process: "if I inserted a comma immediately before 'and' or 'or', would it appear on the first list?" If the answer is "no", the bot removes it from the list.
Fourthly, I'd like a list of all pages whose titles are identical to the third list, except they have a serial comma. Again, redirects are acceptable.
Is this a reasonable request? Please let me know if it's not, so I don't waste your time.
After my request, User:Qwerfjkl suggested that I come here and offered some comments: intitle:/[A-Za-z ]+, [A-Za-z ]+, (and|or) [A-Za-z ]+/ would work for the first request and intitle:/[A-Za-z ]+, [A-Za-z ]+ (and|or) [A-Za-z ]+/ would work for the second. The latter two lists are trickier. To this, I replied Is there a way to download a list of results from a particular search? As far as I know, the only way to get a list of results is to copy/paste the whole thing somewhere and delete everything that's not a page title. (With 11,544 results for the first search, this isn't something I want to do manually.) Also, the first search includes redirects, e.g. Orders, decorations, and medals of the United Nations is result #1.
Please ping me if you reply; I've never heard of this page before, and I might not remember to come back if you don't ping me. Thanks for your help. Nyttend (talk) 03:22, 20 January 2025 (UTC)
@Nyttend: Doing the whole thing would be very awkward; we don't have access to create temporary tables on the replicas, so it would have to all be in one query, and I haven't been able to cajole the query pessimizer into doing it efficiently. What I can do quite easily is give you lists of all mainspace titles that match each of those regexes, along with whether each is a redirect, and what the titles would be if the comma is added/removed (as appropriate). quarry:query/90019 and 90020. Those should be sufficient to construct your four lists, though perhaps not trivially. I can do that too if you're unable, though not tonight and maybe not tomorrow either; let me know. —Cryptic04:32, 20 January 2025 (UTC)
Could you give me two tables with four columns of data? First column is the link, second column is existing status (red, blue redirect, blue non-redirect), third column is alternate-comma variant (with-comma for the without-comma table; without-comma for the with-comma table), and fourth column is status of variant. If you do this in a simple table, I should be able to copy/paste it into Excel, sort so I can easily remove everything I don't want, and move the remaining content back to MediaWiki. Nyttend (talk) 04:44, 20 January 2025 (UTC)
Ah, never mind, I've explored the links and just realised that there's a way to download a CSV with almost all this information. This should suffice. Thanks! Nyttend (talk) 04:47, 20 January 2025 (UTC)
There's no way to see whether they have a parameter. The usual solution is to add a hidden tracking category in the template. —Cryptic18:51, 24 January 2025 (UTC)
References aren't stored in the database at all except in the page text, which isn't copied to the toolforge replicas. —Cryptic14:37, 24 January 2025 (UTC)
Hmm. How is the no refs query made? Is there any way to change it from the existing no refs exists to x number of refs exist. From my very novice understanding of the existing query, it checks no of ext links, can we modify it to find a specific number of them for the requested query? Thanks for your help. ~/Bunnypranav:<ping>16:32, 24 January 2025 (UTC)
It's looking at whether there are any external links, yes, and I'll go ahead and change the query. But it's going to get hardly any results either way - the only pages it finds even without limiting external links at all are Battle of Nalos (3 links) and Operation Qazanchi (6), I suspect by its requirement of a no-citations tag, and one specifically added through page-triage at that. —Cryptic16:50, 24 January 2025 (UTC)
Thanks Cryptic for that change, that will help catch some more possibilities eventually. @DreamRimmer, since I see you have expertise in python scripts, could a py script help find the no. of citations tags for pages younger than 90 days and list them somewhere? ~/Bunnypranav:<ping>06:28, 25 January 2025 (UTC)
I would say this is technically doable, but it's not advisable due to the large number of pages involved, roughly 50,000 to 60,000 pages created in the last 90 days. Checking the number of references for these pages would require making the same number of API requests, which would significantly waste resources. A more efficient approach would be to download the relevant database dumps and generate your list from that data. – DreamRimmer (talk) 07:40, 25 January 2025 (UTC)
Is there any way to limit the number of pages queried from the beginning? In any case, I am ready to atleast try to process the data dumps from my end, could you possibly help me with a py script since I do not know much to start from scratch. Up to you though and thanks for the reply! ~/Bunnypranav:<ping>07:43, 25 January 2025 (UTC)
@Novem Linguae I had to use my staff account to run queries against MediaWiki history, so I can't link the queries directly, and these should be taken with "Sam isn't an expert and might have missed something" caveats, but I found the following:
In 2024, 678,142 pages were created in the article namespace by users (not including bots). Of those, 170,684 are not currently redirects (they may have changed after initial creation, so this isn't precisely a reflection of how many non-redirects were created). The rest of the numbers will be non-redirects and also compared to this number. These numbers all exclude bots, too.
101,987 (60%) pages were created by non-autopatrolled, non-admin users.
756 (0.4%) pages were created by non-autopatrolled, admin users.
67,941 (40%) pages were created by autopatrolled users.
77,412 (45%) pages were created by non-autopatrolled editors with less than 10,000 edits.
25,331 (15%) pages were created by non-autopatrolled editors with greater than or equal to 10,000 edits.
100,929 (59%) pages were created by non-autopatrolled editors who do not have the patroller user group.
1,814 (1%) pages were created by non-autopatrolled editors who do have the patroller user group.
Public version at quarry:query/90509quarry:history/90509/976754/947548. Can't figure out why my numbers disagree - I'm seeing about twice as many pages total - even considering I included pages created by non-users (either ips, or where the first revision had the user revdelled/suppressed, or conceivably bad imports). Percentages excluding those are broadly similar though. —Cryptic00:11, 6 February 2025 (UTC)
Thank you both for your hard work. This data makes it clear that advocating for something like autopatrolling admins would have a negligible effect on reducing the WP:NPP queue (0.4% reduction). This is good data that will help inform some NPP-related decisions. –Novem Linguae (talk) 00:22, 6 February 2025 (UTC)
Frown. Now I'm seeing fewer pages than you - 195800 total currently-non-redirects, 107941 currently-non-redirects by users. And that's without excluding bots. Are you including currently-deleted pages? —Cryptic 00:25, 6 February 2025 (UTC) Nmind, I see what I did, duh. quarry:query/90509 again, now with much-less-disagreeable numbers - all somewhat higher than Sam's, since it still includes bots. Interesting that the percentages turned out close to right even though I was looking at the wrong (essentially random) users' groups. —Cryptic00:46, 6 February 2025 (UTC)
I am actually kind of curious as to why an admin would ever not be autopatrolled. Are there admins who make poor quality new articles? BD2412T02:46, 6 February 2025 (UTC)
@Cryptic Good to hear we're on the same track :) MediaWiki history includes since-deleted pages, so that will be another minor source of disagreement. Sam Walton (talk) 09:03, 6 February 2025 (UTC)
Then my numbers should be lower than yours, instead of ranging from 10-30% higher across the board. Including bots isn't it; there were only four bot "creations" in my dataset, Kansas City shooting, Pedro Campos (disambiguation), 10 Lives (disambiguation), and Chris Sheppard (disambiguation), all by User:RussBot; and two are bad data anyway, showing up only because they were history merged. (History merges, despite being half of that very small sample, aren't going to come anywhere close to accounting for the difference. Even considering that the other two of these pages were cut-and-paste moves and could stand to have their histories fiddled with too.) Is your data source looking at where pages were initially created, or where they are now? —Cryptic11:25, 6 February 2025 (UTC)
@Cryptic Where they were initially created. The only caveat is the redirect issue I noted - the MW History dataset can say whether a page is currently a redirect, but doesn't store data on whether it was a redirect at the time of the edit. That would mean I'm undercounting based on pages which were created and then turned into a redirect some time between then and now, which might account for the difference? Sam Walton (talk) 15:37, 6 February 2025 (UTC)
No, my query also only sees the current redirect status. The difference is that mine also uses the current page location; while there's going to be some pages created in mainspace and moved to draft, there'll be many, many more created in draft or userspace first. These are probably the more relevant numbers, since those need to be patrolled too.I suppose I could add in deleted pages that were last in mainspace, but it's going to be slow - there's no usable index on timestamp, and the one on namespace isn't very selective - and there's no way to see redirect status. —Cryptic20:16, 6 February 2025 (UTC)
Again about New Pages
Following a January backlog drive, for a further research into improving NPP would it be possible to generate the following queries over a 1 month sample 1 to 31 Jan 2025? :
Number and % of new articles created in the article space by non-autopatrolled users, excluding redirects and dab pages
Number and % of new articles created by accounts that existed for less than 30 days and made less than 500 edits
Number and % of new articles excluding redirects and dab pages, that were still unpatrolled after the sample period
Number and % of new articles excluding redirects and dab pages, that got deleted within the sample period.
Number and % of new articles that were recreated from previously deleted titles.
and if possible, Number and % of new articles that were created from redirects.
#4 (I've converted your requests from bullet points), at least, is going to be a problem - we can only see whether a currently-existing page is currently a redirect or dab. There's no access to the wikitext of either existing or deleted pages; for existing ones, there's a field that says whether it's currently a redirect, and dabs can be detected either through their categories or from a page property that's set on disambigs proper (though not set indices), but none of those exist for deleted ones.
The queries that don't mention excluding redirects and dabs, #2 #5 and #6 - was this omission deliberate or not?
For #2, I can compare article creation time directly to its creator's registration time, to see if the user had registered within 30 days of creation; but checking editcount can only reasonably look at the current editcount. Given the latter, should I look at current account age for consistency, or account age at creation for the partially increased accuracy? (The former's a lot easier for these particular numbers, since they happen to match up with the extendedconfirmed group.) —Cryptic00:20, 7 February 2025 (UTC)
The queries that don't mention excluding redirects and dabs, #2 #5 and #6 - was this omission deliberate or not? That is correct. It was deliberate. For #4, Just the number and % of new mainspace articles - any kind - that were deleted, or flagged for CSD or AFD, woud be fine. For #2 whaterver is easiest for you. I'm looking to see by this to extrapolate a hypothesis of what it would have looked like if mainspace articles were onol created by ExtendedConfirmed editors.Thanks. Kudpung กุดผึ้ง (talk) 11:43, 8 February 2025 (UTC)
What I would like is a way to match DOIs with those journals. For example, the journal Advanced Electronic Materials has a list of issues here: https://advanced.onlinelibrary.wiley.com/loi/2199160x. Sampling a few years of articles reveal that all such articles have DOI that start with "10.1002/aelm"
I don't know if there's a way to query Wiley database for this, or if we can use Wikipedia (or dumps) to find our own matches in existing articles.
But the general desire is that for all Wiley journals in that excel sheet, we compile a list of DOIs, and find if there are matching initial patterns.
I would be curious to see a list of userspace subpages that have gone the longest without being touched. I suspect there are tons and tons of abandoned pieces of articles by long-gone users. BD2412T19:30, 14 March 2025 (UTC)
Now I am wondering if this can be correlated against the length of time since the user's last edit, so we can see which are subpages from long-gone editors. BD2412T22:08, 14 March 2025 (UTC)
... both of which are alt accounts of admins. More striking that the no edits ever cases is that the otherwise-most-inactive autopatrolled user is Nichalp, who somewhat infamously lost most of their permissions for undisclosed paid editing. * Pppery *it has begun...15:38, 28 March 2025 (UTC)
Yeah, that stuck out at me too. Also that they got autopatrolled and pcr almost a year and a half after their last edit, and almost exactly a year after all their permissions were revoked. —Cryptic17:06, 28 March 2025 (UTC)
Unusually large redirect talk archives
I recently found out that when pages and their archives are moved but archive template is not updated, ClueBot III keeps adding talk sections below the old location of archive (which is now a redirect). See Special:History/Talk:2024 Kolkata rape and murder incident/Archive 1. I need a list of talk archives that are redirects but are also unusually large (>250 bytes), ordered by page size in descending order. Thanks! —CX Zoom[he/him](let's talk • {C•X})16:15, 2 April 2025 (UTC)
Articles with a particular template missing a data element from that template
How would I create query that finds all articles with a template (e.g. "Infobox university") that are missing a particular data element in that template (e.g. "logo")? A couple scenarios I'm thinking of:
In practical terms, you can't - template parameters aren't exposed to the database except as part of the wikitext, so you'd have to parse that yourself; and the text isn't copied to the database replicas, so you'd have to download a dump to do even that. Or grab the pages' text from the API, if you can narrow the set enough - Special:Whatlinkshere/Template:Infobox university has close to 27000 pages on it, which is about two orders of magnitude too many for that approach to work well. (Top of this page: Quarry does not have access to page content, so queries which require checking wikitext cannot be answered with database queries. In particular, there is no way to query for template parameters or anything related to references.) The usual solution is to change the template to add a tracking category. —Cryptic17:51, 2 April 2025 (UTC)
Thanks @Cryptic! Can you tell me more about how the solution with adding a tracking category would work (or point me to an example)? I'm good with rolling up my sleeves to find a way to do this. 😀 ⚙️ WidgetKid 🙈🙉🙊 19:37, 2 April 2025 (UTC)