Talk:Transformer (deep learning architecture) Source: en.wikipedia.org/wiki/Talk:Transformer_(deep_learning_architecture)

Transformer (deep learning architecture) was nominated as a Engineering and technology good article, but it did not meet the good article criteria at the time (August 12, 2024, reviewed version). There are suggestions on the review page for improving the article. If you can improve it, please do; it may then be renominated.

Linguistics Mid‑importance

	Linguistics portal This article is within the scope of WikiProject Linguistics, a collaborative effort to improve the coverage of linguistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.LinguisticsWikipedia:WikiProject LinguisticsTemplate:WikiProject LinguisticsLinguistics
Mid	This article has been rated as Mid-importance on the project's importance scale.

Computing Mid‑importance

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing
Mid	This article has been rated as Mid-importance on the project's importance scale.

Artificial Intelligence

This article is within the scope of WikiProject Artificial Intelligence, a collaborative effort to improve the coverage of Artificial intelligence on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Artificial IntelligenceWikipedia:WikiProject Artificial IntelligenceTemplate:WikiProject Artificial IntelligenceArtificial Intelligence

Text and/or other creative content from this version of Large language model was copied or moved into Transformer (machine learning model) with this edit on 2 August 2023. The former page's history now serves to provide attribution for that content in the latter page, and it must not be deleted as long as the latter page exists.

On 19 June 2025, it was proposed that this article be moved to Transformer model. The result of the discussion was not moved.

Wiki Education Foundation-supported course assignment

This article was the subject of a Wiki Education Foundation-supported course assignment, between 5 September 2019 and 10 December 2019. Further details are available on the course page. Student editor(s): Iliao2345.

Above undated message substituted from Template:Dashboard.wikiedu.org assignment by PrimeBOT (talk) 04:23, 18 January 2022 (UTC)[reply]

Suggestions for the "Background" section

The first sentence mentions "attention mechanism" without explaining what they are. Unfortunately, no article by that name exists, and a reader looking at the RNN, LSTM, and GRU pages will find no mention of them. I think this paragraph needs to be explicit about *which* specific models introduced attention mechanisms with adequate citation. --Ninepoints (talk) 19:25, 21 July 2020 (UTC)[reply]

For what it's worth, there's this now:

Attention (machine learning)

– AndyFielding (talk) 11:14, 18 April 2024 (UTC)[reply]

Feedback from Logan Paterson on Isaac Liao's article

Logkailp (talk) 14:41, 22 October 2019 (UTC) Praise: - Article does a very good job of laying a groundwork of what Transformers are and giving details on the inner workings of it. - doesn't repeat things too often - links to other articles for applications of transformers instead of unnecessarily writing them out all over again.[reply]

Changes suggested: - I would put a little more background information in the background portion, as I came into the essay knowing nothing about transformers or the way that RNN's or CNN's work, and therefore couldn't grasp the information as well as I could have had I known some background information in the beginning. - Might want to separate the training section from the Architecture section, as they seem to be slightly different topics that could be more distinguished from one another. - Add a little more information in the section on CNN's

Most Important improvement: - More background information like I put above. This may just be a problem with my background knowledge but since the article is meant to be written for "everyone", you may want to add more to give the reader a groundwork of the topic.

Applicable to mine: - I really like your layout of the article and how the article builds from background information to explaining the workings of the topic and how each individual part of a transformer functions to the overall uses and applications of transformers - Smoothly transitioned from topic to topic within each subsection. Logkailp (talk) 14:41, 22 October 2019 (UTC)Logan Paterson[reply]

"Autoregressive" link points to wrong page

Someone linked the "Autoregressive" part of "Autoregressive Convolutional Neural Network" to "Autoencoder". Yes, they both start with "Auto", but this is clearly wrong. I'd fix it, but Wiki has rules these days where you can't fix a mistake unless you log in and then specify why you made a change, sign it, and have some understanding of how the "rules for editing" work? — Preceding unsigned comment added by 65.158.32.123 (talk) 14:05, 13 January 2020 (UTC)[reply]

I've made that change now, thanks. --aricooperdavis (talk) 22:14, 20 January 2020 (UTC)[reply]

Diagrams and simple explanations

Perhaps this is a stupid question, but what do people think of adding diagrams to the article? Also what do people think of adding dummies are us explanations? Daniel.Cardenas (talk) 18:32, 18 October 2020 (UTC)[reply]

Yes, diagrams are a good idea. However, one must ensure that they aren't misleading because then they do more harm than good. I don't know what "dummies are us explanations" mean. ImTheIP (talk) 19:00, 18 October 2020 (UTC)[reply]

AlphaFold, transformers, and attention mechanisms

Given the recent "milestone scientific breakthrough" being hailed for AlphaFold for its results in the protein structure prediction problem at CASP 14, and also their use in computer vision ([1], [2]; also Image GPT), I think it would be useful if we could try to present what they are trying to do in a more general framing perspective, wider and more general than their use in NLP.

(AlphaFold 2 is believed to use two transformer networks as the key core of its design).

In AlphaFold#Algorithm I've written that the transformers

"effect a mathematical transformation of [the elements of two feature-vs-feature matrices].
These transformations have the effect of bringing relevant data together and filtering out irrelevant data for these two relationships, in a context-dependent way (the "attention mechanism"), that can itself be learnt from training data."

I'd be grateful for input as to whether I've got this more or less right?

Transformers therefore seem to be maybe doing a similar job to bottleneck networks, autoencoders, latent variable extractors, and other forms of nonlinear input transformation and dimensional reduction techniques -- but there's obvously more to it than that. It might be useful to identify if there are similarities and differences.

(added): cf Transformers as Variational Autoencoders, found on github

Finally, it's clear that we could use an article on attention (machine learning), aka attention networks, aka attention mechanisms. Some of the following, found by Google, look like they may be relevant, but it would be good to get at least a stub created by someone who knows a bit about it.

Attention and Memory in Deep Learning
Lilian Weng, Attention? Attention!
Attention mechanism, FloydHub
Buomsoo Kim, Attention mechanism
Prodip Hore, Sayan Chatterjee A Comprehensive Guide to Attention Mechanism in Deep Learning for Everyone
also Giuliano Giacaglia, How Transformers Work, which puts attention etc in context.

Pinging @Iliao2345, Toiziz, The Anome, and ImTheIP: as recent editors here, in case you can help. Jheald (talk) 15:06, 2 December 2020 (UTC)[reply]

I agree with everything you say. Please incorporate this into the article. And yes, we should have an article on attention (machine learning), aka attention networks, aka attention mechanisms. I'll create a stub for it now. -- The Anome (talk) 09:11, 3 December 2020 (UTC)[reply]

Any idea on how to find reliable sources in this area? Most of my knowledge in the area comes from github, random blog posts, and YouTube and those sources don't count. Would ArXiv do? ImTheIP (talk) 09:25, 3 December 2020 (UTC)[reply]

@ImTheIP: Well, we're not under WP:MEDRS, or Israel/West Bank restrictions, so sourcing can a little more permissive. Obviously, the usual hierarchy applies, with major textbooks, and reviews and survey articles and tour-de-horizon commentary pieces from the leading journals in the field near the top of tree, and other sources falling somewhere below that. A key criterion is always: does the source have a reputation for knowing what they're talking about. (Also: how mainstream, or introductory, is what they're saying? They maybe get more latitude reviewing the foundations of the field, vs playing up their latest project) My understanding is the ML is a field that very much talks to itself through preprints and conference papers, so arXiv papers should certainly have their place. I also think there is a place for more informal pieces like blogs or videos, which can give more accessible treatments that can be useful to readers. Videos from authoritative sources can certainly be worth adding as External links. With luck, most of this area shouldn't be controversial, so IMO it's a question of finding the balance of references that are most useful to readers. And of course, we're a wiki: so there's always a lot to be said for going with what we've got, establishing a framework or a structure for the topic, then ever-incrementally finding what we can add to the topic. People can always retire old references and ELs, if they have sources that are better.

Incidentally, the paper from Google Research on transformers in computer vision that I linked above (An image is worth 16X16 words: transformers for image recognition at scale) looks very helpful, (and also the [3] tutorial based on it). One nice thing about vision examples is that they can be so visual -- I love the pictures showing the examples of attention.

I've also seen a reference to this paper as being of interest, in applying the transformer model to molecular-biological domains with 3d symmetries.

Nice quote too, from the start of that Google paper, on Transformers vs CNNs: "Transformers lack some of the inductive biases inherent to CNNs, such as translation. equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data. However, the picture changes if the models are trained on larger datasets (14M-300M images). We find that large scale training trumps inductive bias."

-- if I'm reading that right, it's saying that with enough data, transformers can learn the symmetries and adjacencies of 1D, 2D, and 3D spaces, even when they have not been hard-coded in.

I don't want to be editing before I feel I've got a proper grasp and perspective of the subject, so I'd really appreciate if the shape of it could be laid down by those who do. But it does look very interesting! Jheald (talk) 16:28, 3 December 2020 (UTC)[reply]

The name "Transformer"

It would be great to have an explanation for the name "Transformer" included into the article, if there exists one, or a clarification that the name is arbitrary, otherwise. — Preceding unsigned comment added by AVM2019 (talk • contribs) 20:57, 5 December 2020 (UTC)[reply]

Vanilla Transformer Code: Incomplete

The "Pseudocode" section may be doing more to confuse than help because many of the terms are undefined(copy it to Python to see what I mean). So here is what I suggest:

Temporarily remove it.
Update the code to include relevant imports in Pytorch or Tensorflow or make custom definitions so that all terms are well defined in the code.
Post it again. — Preceding unsigned comment added by 103.118.46.204 (talk • contribs)

This was PSEUDO Code not CODE. Why not just leave it. If one is able to program, one will find the right layers in pytorch or tensorflow.... — Preceding unsigned comment added by Nico Hambauer (talk • contribs)

@Nico Hambauer and 103.118.46.204: I would argue that it is not pseudo-code, but rather an incomplete Python implementation. Pseudocode, by its very definition, should not be as language-specific as this code snippet. Python operations such as "embedding()", "multi_head_attention", etc., should not appear in pseudocode; rather, the pseudocode should be readable by programmers in any language, whether or not they are familiar with the operation of these specific Python operations. WikiDan61^ChatMe!_ReadMe!! 13:53, 27 July 2021 (UTC)[reply]

I agree with WikiDan61. Stuff like "multi_head_attention(x, x, x, None)" is completely unreadable for those not already familiar with Python and the framework this is written in. intforce (talk) 14:06, 27 July 2021 (UTC)[reply]

Ok maybe then I am wrong and the one that just stupidly used all the frameworks and now is used to it without noting differences anymore, but I was kind of sad to see it go as it was kind of helpful even if one had to look up all the implementations if not used to the libs. Thanks for the note! Will revert my change then :) — Preceding unsigned comment added by Nico Hambauer (talk • contribs)

@Nico Hambauer: The pseudocode can be made useful if the functioning of the framework functions can be explained, rather than just assuming that the reader knows what they do. The best way to do this would be to include in the pseudocode a declaration of the function with a pseudocode description of its operation. Then the function can be invoked within the pseudocode, since the reader will now have the knowledge required to understand it. WikiDan61^ChatMe!_ReadMe!! 14:55, 27 July 2021 (UTC)[reply]

There is a readable XLNet publication ...

... at arxiv: https://arxiv.org/abs/1906.08237

Is there compelling reason to cite it at OCLC, rather than in the place where people will be able to read it? 222.154.128.36 (talk) 09:14, 2 April 2022 (UTC)[reply]

Suggestion to increase the "Importance" to "Mid" or "High"

With recent progress within AI, transformers are entering more conversations with non-experts. Also, this topic is relevant to a growing number of fields outside of linguistics. Cscangarella (talk) 04:34, 10 April 2023 (UTC)Cscangarella[reply]

This is already oversimplified. It should never devolve even further into an article for people who can't even understand the current form. That would make it useless to the only people who knowledge of Transformers could possibly serve. It needs to become more technical, not less. Someone lacking WP:COMPETENCE might be similarly offended by the articles on specific topics in Pure Mathematics. "Linguistics"... 76.188.120.7 (talk) 18:27, 12 April 2023 (UTC)[reply]

There is simple Wikipedia... People who feel lost should start there. Transormers are an extremely technical topic, where understanding the manipulation of high-dimensional spaces is a prerequisite. 82.102.110.228 (talk) 05:31, 31 December 2024 (UTC)[reply]

NPOV history

Please don't be like Schmidhuber.

Especially nefarious is retroactively naming "linear Transformer" to the 1993 model without explaining it is a retroactive naming, or just quoting old passages where "attention" is used metaphorically as if it is a direct originator of attention mechanism.

I think the fast weight controller is not a hushed-up origin of modern Transformers, but rather an attempt to apply high-order neural networks, or pi-sigma networks (1991), to the problem of processing sequential data. It failed to gain traction and plain LSTM dominated until 2014 when seq2seq introduced attention mechanism to LSTM, and 2017 purified attention mechanism into the Transformer. pony in a strange land (talk) 01:06, 24 April 2023 (UTC)[reply]

Schmidhuber overclaims, but so does the LeHingio. As per NPOV WP:DUE all viewpoints should be represented in accordance to their weight. Such nefarious names such as linear transformers then should be attributed e.g. to Schmidhuber's deep learning review (>20000 citations), or to other appropriate publications. 82.102.110.228 (talk) 07:51, 31 December 2024 (UTC)[reply]

relies too much primary ref?

I think the notice on relying too much on primary references is not correct. The article has nearly 90 references. The primary reference here would be the 2017 paper (all you need is attention) ans possibly some work leading up to that paper. However, most papers are after that, by different authors. Those are academic references, but not primary to the transformer architecture. Bquast (talk) 15:12, 13 May 2023 (UTC)[reply]

I suggest to remove the notice. Maybe an inline notice of having more non-academic sources good be added lower down. Bquast (talk) 15:13, 13 May 2023 (UTC)[reply]

IMO the introduction and history sections should have primarily older literature. The training/architecture sections can benefit from newer literature, especially the training section. 82.102.110.228 (talk) 07:57, 31 December 2024 (UTC)[reply]

Did Jürgen Schmidhuber invent Transformers?

Conflicting edits have added/removed statements such as In 1992, the first kind of Transformer was published by Jürgen Schmidhuber under the name "fast weight controller."

Schmidhuber has been involved in multiple controversies over what he terms credit assignment^[1]. He holds a minority but not fringe view, regarding the proper attribution of ideas in the field of AI.^[2]^[3]^[4]

The paper "Attention is All You Need"^[5] by Vaswani et al describes the Transformer as follows: "In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention."

The paper Learning to Control Fast-Weight Memories^[6] by Jürgen Schmidhuber describes the Fast Weight Controller as: "This paper describes an alternative class of gradient-based systems consisting of two feedforward nets that learn to deal with temporal sequences using fast weights: the first net learns to produce context dependent weight changes for the second net whose weights may vary very quickly."

There is not an immediate resemblance between the two methods: Transformers are a sequence-to-sequence model using self-attention, and Fast-Weight Controllers sound more like a predecessor to Hypernetworks^[7] ("an approach of using one network...to generate the weights for another network") or Memory Networks ^[8].

But, in the years after the Transformer gained popularity, several modified and altered systems based on the Transformer were proposed. One such system was the Linear Transformer^[9] by Katharopoulos et al. which "[expresses] the self-attention as a linear dot-product of kernel feature maps... We show that this formulation permits an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks."

The Linear Transformer is not the same as the Transformer, but in the paper Linear Transformers Are Secretly Fast Weight Programmers^[10] Schmidhuber proves that it is mathematically equivalent to the Fast-Weight Controller, apart from its normalization scheme.

To cover Jürgen Schmidhuber's contributions without violating either WP:NPOV or WP:UNDUE, I propose that the article should make clear the following:

Schmidhuber invented the Fast Weight Controller
The FWC was mathematically almost identical to Katharopoulos' Linear Transformer, but not to Vaswani's Transformer
The FWC did not have the language-processing capabilities of a modern Transformer
The FWC is a notable historical contribution to the line of research that produced the Transformer (along with other forms of recurrent neural networks in the 80s, 90s, and 2000s.)

Lwneal (talk) 18:06, 12 August 2023 (UTC)[reply]

Agreed with Lwneal. +Schmidhuber not fringe, +Mathematical near-identity to linear transformer, +not in time for compute, +still an important historical contribution. 82.102.110.228 (talk) 08:01, 31 December 2024 (UTC)[reply]

"Decoder only" is ill defined

Description of decoder block lists the original three sub-layers (a self-attention mechanism, an attention mechanism over the encodings, and a feed-forward neural network) but later in the Terminology section "decoder only" is defined as autoregressive encoder, autoregressive decoder.

The words Decoder Only implies the lack of an encoder yet nothing in the article addresses how this 'autoregressive encoding' is happening sans encoder or the shape of the decoder block since 'an attention mechanism over the encodings' is confusing when the source of encodings is not given in this case 101.178.0.181 (talk) 01:15, 4 September 2023 (UTC)[reply]

references 33 and 35 seem unhelpful

Why is there a need to cite a paper from the arXiv, published a year after the paper which made a scientific leap?
What is the point of the Ithaca example?Ladypine (talk) 09:02, 31 October 2023 (UTC)[reply]

Loss function

Seemingly not covered in the article: when creating a transformer, what is the loss function to be minimized? I see in the article that once trained, a transformer can be used with a post-processing layer (or layers) to be trained, which enable a specific task such as classification. I understand a loss function for the transformer-plus-classification task, but what is the loss function used on the raw transformer before a specific task is chosen to be appended?

Or putting it another way, I can't be the only person who is looking for mention of a loss function. I would very much appreciate a sentence along the lines of one of these:

The loss function is, in effect, ....
In lieu of a loss function, ....

Thanks — $Q$ uantling (talk | contribs) 20:44, 1 November 2023 (UTC)[reply]

You have to attach a task head later and the task head uses some loss function suitable to solve your task Biggerj1 (talk) 21:58, 6 March 2024 (UTC)[reply]

Wiki Education assignment: Research Process and Methodology - FA23 - Sect 202 - Thu

This article was the subject of a Wiki Education Foundation-supported course assignment, between 6 September 2023 and 14 December 2023. Further details are available on the course page. Student editor(s): HELLOEXTRACREDIT (article contribs).

— Assignment last updated by HELLOEXTRACREDIT (talk) 20:51, 11 November 2023 (UTC)[reply]

Wiki Education assignment: Linguistics in the Digital Age

This article was the subject of a Wiki Education Foundation-supported course assignment, between 21 August 2023 and 11 December 2023. Further details are available on the course page. Student editor(s): Gh0828 (article contribs).

— Assignment last updated by Fedfed2 (talk) 00:54, 9 December 2023 (UTC)[reply]

Transformers transform what?

I came to this article to learn what a "Transformer" is or does. After reading it twice, I still haven't determined much of anything of about why it would be called a "transformer" or what place in an A.I. system it fits. According to Wikipedia tradition, and probably the MOS, the answer should have been in the first few sentences. Instead, I have dug through a word salad of gobblydagoop and have only faint impressions of the underlying technology involved but no clear, top-level understanding of what it does. —EncMstr (talk) 22:08, 29 March 2024 (UTC)[reply]

The name isn't of much importance to be honest. Researchers like naming things any which way. 80.2.247.44 (talk) 20:19, 6 July 2024 (UTC)[reply]

Full transformer architecture pseudocode

I find the pseudocode adapted from Phuong and Hutter very informative but it is rather counter intuitive to have 3 parameters that are exactly the same in the multiheaded_attention(z_e, z_e, z_e) function. I see that this follows from the conceptual explanation in the previous sections on Encoder and Decoder in which H is the combined matrix of input vectors corresponding to projection matrices (i.e., Q, K, V, that is, query, key, value weights, respectively). In the cited article, H is a parameter denoting the number of attention heads (for the multiheaded attention) and input to attention functions (Algorithms 4 & 5) are primary and positional encoding of tokens where as the weights of the projection matrices are parameters. Not sure it is accurate to use z_e to represent the weights of the projection matrices, though I understand that it makes it easier to highlight that in the case of decoder, the encoder-decoder attention uses the the information from the encoder's token embeddings (i.e., multiheaded_attention(z_d, z_e, z_e), the positional embedding is already inherently present within the encoder object). However, it also creates some confusion in terms of how exactly the attention works. I wonder whether it would make sense to include this information, sth along the lines of multiheaded_attention(z_e, H, W) where W could be specified as "W_enc", "W_dec", "W_encdec" and depends on z_e only in W_enc, z_d only in W_dec and both z_e & z_d in W_encdec (as per Algorithm 8 in the article referred). This would also apply to masked_multiheaded_attention function. Emreg00 (talk) 14:18, 14 February 2025 (UTC)[reply]

Timeline too long

The timeline is currently 90% just some highlights of language modeling, 1990-2018. Also, it gives undue focus to Schmidhuber. Despite what he says, attention mechanism had been studied in vision models since late 1980s, see Attention (machine learning) § History for references.

A good rule of thumb is to assume Schmidhuber is wrong about history. pony in a strange land (talk) 09:03, 6 August 2024 (UTC)[reply]

GA Review

Unsuccessful. Phlsph7 (talk) 08:14, 12 August 2024 (UTC)[reply]

The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

This review is transcluded from Talk:Transformer (deep learning architecture)/GA1. The edit link for this section can be used to add comments to the review.

Nominator: Cosmia Nebula (talk · contribs) 19:04, 9 August 2024 (UTC)[reply]

Reviewer: Phlsph7 (talk · contribs) 08:14, 12 August 2024 (UTC)[reply]

Hello Cosmia Nebula and thanks for all your improvements to this article. However, despite the improvements, the article fails criterion 2b since there are too many unreferenced paragraphs. Examples are the paragraphs starting with "For many years, sequence modelling ", "As the Transformer architecture natively processes", and "A positional encoding is a fixed-size vector". According to criterion 2b, these passages require inline citations "no later than the end of the paragraph".

The article cites many papers from arXiv. They are usually considered self-published sources, making them unreliable, see WP:ARXIV. Maybe some of them are also published in reliable journals, in which case you could cite these versions instead. You would probably have to replace the rest with other sources.

I suggest that you add all the missing references and replace the arXiv papers before a renomination.

A few other observations

WP:EARWIG detects no copyvios
Linear transformers were first developed as an improvement over previous architectures for machine translation, but has found many applications since then. there is a problem with the clause starting with "but", should it be "..., but many additional applications have been found for them since then"?
An well-cited early example was replace "An well-cited" with "A well-cited" or maybe with "An often-cited"
One key innovation was use of an attention mechanism add "the" before "use"
by removing its recurrence to processes all tokens in parallel should this be "to process" instead of "to processes"?

Phlsph7 (talk) 08:14, 12 August 2024 (UTC)[reply]

The discussion above is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

Requested move 19 June 2025

The following is a closed discussion of a requested move. Please do not modify it. Subsequent comments should be made in a new section on the talk page. Editors desiring to contest the closing decision should consider a move review after discussing it on the closer's talk page. No further edits should be made to this discussion.

The result of the move request was: not moved. (closed by non-admin page mover) Bensci54 (talk) 16:56, 26 June 2025 (UTC)[reply]

Transformer (deep learning architecture) → Transformer modelTransformer model – Current dab is needlessly wordy given that any of the alternatives with only a single adjunct already suffices to identify the topic of the article, as also shown by how many of them are currently WP:PRIMARYREDIRECTs to this article. Transformer architecture is also fairly common but a bit less concise, and some of the other alternatives like transformer network or transformer (machine learning) (or transformer (deep learning) for that matter) may also be plausible, but I see no reason for there to be two adjuncts attached to this title on any interpretation of our titling policy. Alpha3031 (t • c) 11:00, 19 June 2025 (UTC)[reply]

Oppose. "Transformer model" is not unambiguous enough. It could also refer to physical models representing the behavior of an electrical transformer. Moreover, the transformer in ML is not really a "model" by itself; rather, it is a high-level architecture that forms the basis of specific instantiations, such as BERT, GPT, T5, etc., and it doesn't directly account for the underlying task-specific objectives. 240B:C020:493:CE18:B5E8:CA08:16FE:1BEB (talk) 18:53, 21 June 2025 (UTC)[reply]
Oppose to "Transformer model" Citing the Machine learning page: A machine learning model is a type of mathematical model that, once "trained" on a given dataset, can be used to make predictions or classifications on new data., while I believe the main topic of the page is the architecture. If the idea is to read the propose name "Transformer model" as "architecture used in Transformer based model", the new name seems unnecessarily implicit and confusing, in my opinion that is big loss for to arguably negligible gain of few characters less compared to "Transform architecture". Among the proposed pairs of terms, the one, I believe, can pin down the concept more narrowly is Transformer (deep learning) but also Transformer (machine learning) can be a good option. Related title examples: Neural network (machine learning), Fine-tuning (deep learning) and Attention_(machine_learning). Ttmms (talk) 21:50, 21 June 2025 (UTC)[reply]
Ttmms, WP:NATURALDAB is typically preferred over WP:PARENDIS. Though, transformer network would probably be better for WP:CRITERIA #5 given all the other Category:Neural network architectures are "X network". Alpha3031 (t • c) 09:52, 22 June 2025 (UTC)[reply]
@Alpha3031: Which "all the other" are you referring to, I can think of Feedforward_neural_network, Recurrent_neural_network or Convolutional_neural_network but those are more "network-like" than transformers, where there is also a matrix dot product (the attention part), interleaved with feed-forward networks. I think technically speaking one can see the computation graph of dot-product also as network but I'm not sure how common that is. My guess is that in literature the exact string "transformer network" is used less often than "transformer architecture", would be nice to make a quick research on that. Also if you know some other wiki rename/disambiguation cases where WP:NATURALDAB has been chosen would be great to compare. Ttmms (talk) 10:26, 22 June 2025 (UTC)[reply]
We usually reference policies and guidelines instead of individual cases Ttmms, because there are probably hundreds of thousands of them and it would be difficult to properly sample them. Also, I was trying to link Category:Neural network architectures but I missed a :, which causes it to categorise this page into the category instead of displaying it. That might be why you didn't see anything. I've fixed it now. Alpha3031 (t • c) 10:44, 22 June 2025 (UTC)[reply]
I made a very quick search on Arxiv, from my interpretation it clearly puts transformer architecture above transformer network, the latter only compare in the format of "(spatial|graph|image|X) transformer network" to indicate a more specific/elaborated network build on transformer (architecture). Ttmms (talk) 10:54, 22 June 2025 (UTC)[reply]
Anyway, I don't think I actually mind any one of the three (i.e., network|model|architecture), they're all fairly common (though I'd agree model and architecture are more so than network, which is why I put it third in my nom) so the trade off partially comes down to how people want to balance WP:CRITERIA. Discussion on that is why I opened this as a full RM instead of a TR. Alpha3031 (t • c) 15:42, 26 June 2025 (UTC)[reply]

I oppose “Transformer Network“ because it makes me think of a network of electrical transformers. 24.19.113.134 (talk) 21:53, 22 June 2025 (UTC)[reply]

Oppose. "Transformer model" for the title of this article is highly ambiguous. I think in the title, we should mention "deep learning" to make it expressive. Hooman Mallahzadeh (talk) 15:54, 26 June 2025 (UTC)[reply]
What about the other two natdis alternatives (network|architecture) Hooman Mallahzadeh? Also, what is it actually ambiguous with, in RS, because from what I can tell this is the WP:PRIMARYTOPIC for all three by substantial margins. Like, even trying to bring up electrical topics (e.g. "transformer network" electrical as search term) you're more likely articles referring to T(A)NN than anything else. Alpha3031 (t • c) 16:16, 26 June 2025 (UTC)[reply]
@Alpha3031 Please see Transformer (disambiguation). Even in science, it has 6 applications. I think "Transformer model" for an ordinary person has no relevant meaning to "neural networks", and only an expert understands that.

Please test this phenomenon: Ask an ordinary (not expert) person: "What is a «transformer model»?" and please report answers. Thanks, Hooman Mallahzadeh (talk) 16:34, 26 June 2025 (UTC)[reply]
As mentioned in the article, "Transformer" is not a "model" or "network" but it is an "architecture". Hooman Mallahzadeh (talk) 16:44, 26 June 2025 (UTC)[reply]
Our policy require someone familiar with [...] the subject area to recognise the name or description Hooman Mallahzadeh, not an ordinary person or an expert. This is why the title of gating mechanism doesn't explain that it doesn't mean a literal physical mechanism used in a gate. It is the job of the Wikipedia:Short description and first sentence of the article to tell people more about the article subject. Alpha3031 (t • c) 16:55, 26 June 2025 (UTC)[reply]
@Alpha3031 You are right! I agree with "Transformer architecture" as the title of this article, because it is not "model" or "network". Hooman Mallahzadeh (talk) 17:06, 26 June 2025 (UTC)[reply]

The discussion above is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

[1] ttps://people.idsia.ch/~juergen/deep-learning-history.html

[2] ttps://www.nytimes.com/2016/11/27/technology/artificial-intelligence-pioneer-jurgen-schmidhuber-overlooked.html

[3] ttps://www.youtube.com/watch?v=HGYYEUSm-0Q&t=3770s

[4] ttps://people.idsia.ch/~juergen/critique-turing-award-bengio-hinton-lecun.html

[5] ttps://arxiv.org/pdf/1706.03762.pdf

[6] ttps://mediatum.ub.tum.de/doc/814768/document.pdf

[7] ttps://arxiv.org/pdf/1609.09106.pdf

[8] ttps://arxiv.org/pdf/1410.3916)

[9] ttps://linear-transformers.com/

[10] ttps://arxiv.org/pdf/2102.11174.pdf

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]