![]() | Transformer (deep learning architecture) was nominated as a Engineering and technology good article, but it did not meet the good article criteria at the time (August 12, 2024, reviewed version). There are suggestions on the review page for improving the article. If you can improve it, please do; it may then be renominated. |
![]() | This article is rated B-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||||||||||||
|
![]() | Text and/or other creative content from this version of Large language model was copied or moved into Transformer (machine learning model) with this edit on 2 August 2023. The former page's history now serves to provide attribution for that content in the latter page, and it must not be deleted as long as the latter page exists. |
![]() | On 19 June 2025, it was proposed that this article be moved to Transformer model. The result of the discussion was not moved. |
This article was the subject of a Wiki Education Foundation-supported course assignment, between 5 September 2019 and 10 December 2019. Further details are available on the course page. Student editor(s): Iliao2345.
Above undated message substituted from Template:Dashboard.wikiedu.org assignment by PrimeBOT (talk) 04:23, 18 January 2022 (UTC)
The first sentence mentions "attention mechanism" without explaining what they are. Unfortunately, no article by that name exists, and a reader looking at the RNN, LSTM, and GRU pages will find no mention of them. I think this paragraph needs to be explicit about *which* specific models introduced attention mechanisms with adequate citation. --Ninepoints (talk) 19:25, 21 July 2020 (UTC)
Logkailp (talk) 14:41, 22 October 2019 (UTC) Praise: - Article does a very good job of laying a groundwork of what Transformers are and giving details on the inner workings of it. - doesn't repeat things too often - links to other articles for applications of transformers instead of unnecessarily writing them out all over again.
Changes suggested: - I would put a little more background information in the background portion, as I came into the essay knowing nothing about transformers or the way that RNN's or CNN's work, and therefore couldn't grasp the information as well as I could have had I known some background information in the beginning. - Might want to separate the training section from the Architecture section, as they seem to be slightly different topics that could be more distinguished from one another. - Add a little more information in the section on CNN's
Most Important improvement: - More background information like I put above. This may just be a problem with my background knowledge but since the article is meant to be written for "everyone", you may want to add more to give the reader a groundwork of the topic.
Applicable to mine: - I really like your layout of the article and how the article builds from background information to explaining the workings of the topic and how each individual part of a transformer functions to the overall uses and applications of transformers - Smoothly transitioned from topic to topic within each subsection. Logkailp (talk) 14:41, 22 October 2019 (UTC)Logan Paterson
Someone linked the "Autoregressive" part of "Autoregressive Convolutional Neural Network" to "Autoencoder". Yes, they both start with "Auto", but this is clearly wrong. I'd fix it, but Wiki has rules these days where you can't fix a mistake unless you log in and then specify why you made a change, sign it, and have some understanding of how the "rules for editing" work? — Preceding unsigned comment added by 65.158.32.123 (talk) 14:05, 13 January 2020 (UTC)
I've made that change now, thanks. --aricooperdavis (talk) 22:14, 20 January 2020 (UTC)
Perhaps this is a stupid question, but what do people think of adding diagrams to the article? Also what do people think of adding dummies are us explanations? Daniel.Cardenas (talk) 18:32, 18 October 2020 (UTC)
Given the recent "milestone scientific breakthrough" being hailed for AlphaFold for its results in the protein structure prediction problem at CASP 14, and also their use in computer vision ([1], [2]; also Image GPT), I think it would be useful if we could try to present what they are trying to do in a more general framing perspective, wider and more general than their use in NLP.
(AlphaFold 2 is believed to use two transformer networks as the key core of its design).
In AlphaFold#Algorithm I've written that the transformers
"effect a mathematical transformation of [the elements of two feature-vs-feature matrices].
These transformations have the effect of bringing relevant data together and filtering out irrelevant data for these two relationships, in a context-dependent way (the "attention mechanism"), that can itself be learnt from training data."
I'd be grateful for input as to whether I've got this more or less right?
Transformers therefore seem to be maybe doing a similar job to bottleneck networks, autoencoders, latent variable extractors, and other forms of nonlinear input transformation and dimensional reduction techniques -- but there's obvously more to it than that. It might be useful to identify if there are similarities and differences.
Finally, it's clear that we could use an article on attention (machine learning), aka attention networks, aka attention mechanisms. Some of the following, found by Google, look like they may be relevant, but it would be good to get at least a stub created by someone who knows a bit about it.
Pinging @Iliao2345, Toiziz, The Anome, and ImTheIP: as recent editors here, in case you can help. Jheald (talk) 15:06, 2 December 2020 (UTC)
It would be great to have an explanation for the name "Transformer" included into the article, if there exists one, or a clarification that the name is arbitrary, otherwise. — Preceding unsigned comment added by AVM2019 (talk • contribs) 20:57, 5 December 2020 (UTC)
The "Pseudocode" section may be doing more to confuse than help because many of the terms are undefined(copy it to Python to see what I mean). So here is what I suggest:
This was PSEUDO Code not CODE. Why not just leave it. If one is able to program, one will find the right layers in pytorch or tensorflow.... — Preceding unsigned comment added by Nico Hambauer (talk • contribs)
Ok maybe then I am wrong and the one that just stupidly used all the frameworks and now is used to it without noting differences anymore, but I was kind of sad to see it go as it was kind of helpful even if one had to look up all the implementations if not used to the libs. Thanks for the note! Will revert my change then :) — Preceding unsigned comment added by Nico Hambauer (talk • contribs)
... at arxiv: https://arxiv.org/abs/1906.08237
Is there compelling reason to cite it at OCLC, rather than in the place where people will be able to read it? 222.154.128.36 (talk) 09:14, 2 April 2022 (UTC)
With recent progress within AI, transformers are entering more conversations with non-experts. Also, this topic is relevant to a growing number of fields outside of linguistics. Cscangarella (talk) 04:34, 10 April 2023 (UTC)Cscangarella
Please don't be like Schmidhuber.
Especially nefarious is retroactively naming "linear Transformer" to the 1993 model without explaining it is a retroactive naming, or just quoting old passages where "attention" is used metaphorically as if it is a direct originator of attention mechanism.
I think the fast weight controller is not a hushed-up origin of modern Transformers, but rather an attempt to apply high-order neural networks, or pi-sigma networks (1991), to the problem of processing sequential data. It failed to gain traction and plain LSTM dominated until 2014 when seq2seq introduced attention mechanism to LSTM, and 2017 purified attention mechanism into the Transformer. pony in a strange land (talk) 01:06, 24 April 2023 (UTC)
I think the notice on relying too much on primary references is not correct. The article has nearly 90 references. The primary reference here would be the 2017 paper (all you need is attention) ans possibly some work leading up to that paper. However, most papers are after that, by different authors. Those are academic references, but not primary to the transformer architecture. Bquast (talk) 15:12, 13 May 2023 (UTC)
Conflicting edits have added/removed statements such as In 1992, the first kind of Transformer was published by Jürgen Schmidhuber under the name "fast weight controller."
Schmidhuber has been involved in multiple controversies over what he terms credit assignment[1]. He holds a minority but not fringe view, regarding the proper attribution of ideas in the field of AI.[2][3][4]
The paper "Attention is All You Need"[5] by Vaswani et al describes the Transformer as follows: "In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention."
The paper Learning to Control Fast-Weight Memories[6] by Jürgen Schmidhuber describes the Fast Weight Controller as: "This paper describes an alternative class of gradient-based systems consisting of two feedforward nets that learn to deal with temporal sequences using fast weights: the first net learns to produce context dependent weight changes for the second net whose weights may vary very quickly."
There is not an immediate resemblance between the two methods: Transformers are a sequence-to-sequence model using self-attention, and Fast-Weight Controllers sound more like a predecessor to Hypernetworks[7] ("an approach of using one network...to generate the weights for another network") or Memory Networks [8].
But, in the years after the Transformer gained popularity, several modified and altered systems based on the Transformer were proposed. One such system was the Linear Transformer[9] by Katharopoulos et al. which "[expresses] the self-attention as a linear dot-product of kernel feature maps... We show that this formulation permits an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks."
The Linear Transformer is not the same as the Transformer, but in the paper Linear Transformers Are Secretly Fast Weight Programmers[10] Schmidhuber proves that it is mathematically equivalent to the Fast-Weight Controller, apart from its normalization scheme.
To cover Jürgen Schmidhuber's contributions without violating either WP:NPOV or WP:UNDUE, I propose that the article should make clear the following:
Lwneal (talk) 18:06, 12 August 2023 (UTC)
References
Description of decoder block lists the original three sub-layers (a self-attention mechanism, an attention mechanism over the encodings, and a feed-forward neural network) but later in the Terminology section "decoder only" is defined as autoregressive encoder, autoregressive decoder.
The words Decoder Only implies the lack of an encoder yet nothing in the article addresses how this 'autoregressive encoding' is happening sans encoder or the shape of the decoder block since 'an attention mechanism over the encodings' is confusing when the source of encodings is not given in this case 101.178.0.181 (talk) 01:15, 4 September 2023 (UTC)
Seemingly not covered in the article: when creating a transformer, what is the loss function to be minimized? I see in the article that once trained, a transformer can be used with a post-processing layer (or layers) to be trained, which enable a specific task such as classification. I understand a loss function for the transformer-plus-classification task, but what is the loss function used on the raw transformer before a specific task is chosen to be appended?
Or putting it another way, I can't be the only person who is looking for mention of a loss function. I would very much appreciate a sentence along the lines of one of these:
Thanks —Quantling (talk | contribs) 20:44, 1 November 2023 (UTC)
This article was the subject of a Wiki Education Foundation-supported course assignment, between 6 September 2023 and 14 December 2023. Further details are available on the course page. Student editor(s): HELLOEXTRACREDIT (article contribs).
— Assignment last updated by HELLOEXTRACREDIT (talk) 20:51, 11 November 2023 (UTC)
This article was the subject of a Wiki Education Foundation-supported course assignment, between 21 August 2023 and 11 December 2023. Further details are available on the course page. Student editor(s): Gh0828 (article contribs).
— Assignment last updated by Fedfed2 (talk) 00:54, 9 December 2023 (UTC)
I came to this article to learn what a "Transformer" is or does. After reading it twice, I still haven't determined much of anything of about why it would be called a "transformer" or what place in an A.I. system it fits. According to Wikipedia tradition, and probably the MOS, the answer should have been in the first few sentences. Instead, I have dug through a word salad of gobblydagoop and have only faint impressions of the underlying technology involved but no clear, top-level understanding of what it does. —EncMstr (talk) 22:08, 29 March 2024 (UTC)
I find the pseudocode adapted from Phuong and Hutter very informative but it is rather counter intuitive to have 3 parameters that are exactly the same in the multiheaded_attention(z_e, z_e, z_e) function. I see that this follows from the conceptual explanation in the previous sections on Encoder and Decoder in which H is the combined matrix of input vectors corresponding to projection matrices (i.e., Q, K, V, that is, query, key, value weights, respectively). In the cited article, H is a parameter denoting the number of attention heads (for the multiheaded attention) and input to attention functions (Algorithms 4 & 5) are primary and positional encoding of tokens where as the weights of the projection matrices are parameters. Not sure it is accurate to use z_e to represent the weights of the projection matrices, though I understand that it makes it easier to highlight that in the case of decoder, the encoder-decoder attention uses the the information from the encoder's token embeddings (i.e., multiheaded_attention(z_d, z_e, z_e), the positional embedding is already inherently present within the encoder object). However, it also creates some confusion in terms of how exactly the attention works. I wonder whether it would make sense to include this information, sth along the lines of multiheaded_attention(z_e, H, W) where W could be specified as "W_enc", "W_dec", "W_encdec" and depends on z_e only in W_enc, z_d only in W_dec and both z_e & z_d in W_encdec (as per Algorithm 8 in the article referred). This would also apply to masked_multiheaded_attention function. Emreg00 (talk) 14:18, 14 February 2025 (UTC)
The timeline is currently 90% just some highlights of language modeling, 1990-2018. Also, it gives undue focus to Schmidhuber. Despite what he says, attention mechanism had been studied in vision models since late 1980s, see Attention (machine learning) § History for references.
A good rule of thumb is to assume Schmidhuber is wrong about history. pony in a strange land (talk) 09:03, 6 August 2024 (UTC)
The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.
GA toolbox |
---|
Reviewing |
Nominator: Cosmia Nebula (talk · contribs) 19:04, 9 August 2024 (UTC)
Reviewer: Phlsph7 (talk · contribs) 08:14, 12 August 2024 (UTC)
Hello Cosmia Nebula and thanks for all your improvements to this article. However, despite the improvements, the article fails criterion 2b since there are too many unreferenced paragraphs. Examples are the paragraphs starting with "For many years, sequence modelling ", "As the Transformer architecture natively processes", and "A positional encoding is a fixed-size vector". According to criterion 2b, these passages require inline citations "no later than the end of the paragraph".
The article cites many papers from arXiv. They are usually considered self-published sources, making them unreliable, see WP:ARXIV. Maybe some of them are also published in reliable journals, in which case you could cite these versions instead. You would probably have to replace the rest with other sources.
I suggest that you add all the missing references and replace the arXiv papers before a renomination.
A few other observations
Phlsph7 (talk) 08:14, 12 August 2024 (UTC)
The result of the move request was: not moved. (closed by non-admin page mover) Bensci54 (talk) 16:56, 26 June 2025 (UTC)
Transformer (deep learning architecture) → Transformer model – Current dab is needlessly wordy given that any of the alternatives with only a single adjunct already suffices to identify the topic of the article, as also shown by how many of them are currently WP:PRIMARYREDIRECTs to this article. Transformer architecture is also fairly common but a bit less concise, and some of the other alternatives like transformer network or transformer (machine learning) (or transformer (deep learning) for that matter) may also be plausible, but I see no reason for there to be two adjuncts attached to this title on any interpretation of our titling policy. Alpha3031 (t • c) 11:00, 19 June 2025 (UTC)
"transformer network" electricalas search term) you're more likely articles referring to T(A)NN than anything else. Alpha3031 (t • c) 16:16, 26 June 2025 (UTC)
someone familiar with [...] the subject areato recognise the name or description Hooman Mallahzadeh, not
an ordinary personor an
expert. This is why the title of gating mechanism doesn't explain that it doesn't mean a literal physical mechanism used in a gate. It is the job of the Wikipedia:Short description and first sentence of the article to tell people more about the article subject. Alpha3031 (t • c) 16:55, 26 June 2025 (UTC)