Re: [新聞] OpenAI:已掌握DeepSeek盜用模型證據

treasurehill 發表於 2025/1/30 下午3:29:35

看板HatePolitics標題Re: [新聞] OpenAI:已掌握DeepSeek盜用模型證據作者

(寶藏巖公社，你還未夠班S)時間Jan 30 15:29:35 2025推噓 2 推:3 噓:1 →:16

推 skyyo: 這篇雙方都很專業了想不到綠色濾鏡可以影 27.247.1.211 01/30 14:54→ skyyo: 響這麼大@@ 27.247.1.211 01/30 14:54→ skyyo: 其實應該要問寶藏巖你覺得你的綠共友們說 27.247.1.211 01/30 14:55→ skyyo: 問不到64+用了蒸餾就是垃圾AI 27.247.1.211 01/30 14:55→ skyyo: 這種觀點的看法嘻嘻 27.247.1.211 01/30 14:55→ skyyo: 喔還有綠共代表選手chatDPP跟deepseek的差 27.247.1.211 01/30 14:57→ skyyo: 距 27.247.1.211 01/30 14:57

笑死!只要稍微學習過混沌理論都知道

這種用同一份資料自我遞迴的學習方式在某個特殊的臨界點下就會導致嚴重的系統

崩潰

因為他無法保持原生資料的多樣性

最終的結果就是導致同樣的垃圾AI生成資料在系統內亂竄

最終導致一連串的垃圾輸出

任何一位熟悉資料探勘的資訊系教授在開始第一堂課都會開宗明義的宣導這基本概念

只有AI門外漢才會一直吹捧知識蒸鰡是多麼高明的技術

在業內的人看起就是一個笑話而已

https://arxiv.org/abs/2305.17493v2

The Curse of Recursion: Training on Generated Data Makes Models Forget
Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot,
Ross Anderson
Stable Diffusion revolutionised image creation from descriptive text. GPT-2,
GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of
language tasks. ChatGPT introduced such language models to the general
public. It is now clear that large language models (LLMs) are here to stay,
and will bring about drastic change in the whole ecosystem of online text andimages. In this paper we consider what the future might hold. What will
happen to GPT-{n} once LLMs contribute much of the language found online? We
find that use of model-generated content in training causes irreversible
defects in the resulting models, where tails of the original content
distribution disappear. We refer to this effect as Model Collapse and show
that it can occur in Variational Autoencoders, Gaussian Mixture Models and
LLMs. We build theoretical intuition behind the phenomenon and portray its
ubiquity amongst all learned generative models. We demonstrate that it has tobe taken seriously if we are to sustain the benefits of training from
large-scale data scraped from the web. Indeed, the value of data collected
about genuine human interactions with systems will be increasingly valuable
in the presence of content generated by LLMs in data crawled from the
Internet.
--

※ PTT留言評論

※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 42.70.83.123 (臺灣)

※ PTT 網址

→

amordelcor 01/30 15:32=.= 這種看類AI股價就知道了

噓

Sinreigensou 01/30 15:33不要亂用名詞混沌理論哪是這個情況

你混沌理論沒學好吧!連耗散結構都不知道

https://reurl.cc/yD0WGy

推

lono 01/30 15:36這觀念根本不基本沒人知道為什麼

→

aragorn747 01/30 15:38神靈+槓精同時出現滿好笑的

→

ferb 01/30 15:39openai早被美國企業告盜用

→

ferb 01/30 15:39律師不談談嗎

→

ferb 01/30 15:40而且他們還開啟收費盈利模式

※ 編輯: treasurehill (42.70.83.123 臺灣), 01/30/2025 15:42:48 ※ 編輯: treasurehill (42.70.83.123 臺灣), 01/30/2025 15:52:45 ※ 編輯: treasurehill (42.70.83.123 臺灣), 01/30/2025 15:53:16

→

William 01/30 15:55現代ML跟混沌有關？也太亂套了...

→

William 01/30 15:59還有deepseek的實作就不是你想像的用同

→

William 01/30 15:59一份資料重複學習..

笑死!我講在自我遞迴系統的崩潰，你連看都沒看懂就出來耍寶的

https://arxiv.org/pdf/2305.17493v2

The Curse of Recursion: Training on Generated Data Makes Models Forget

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, Ross Anderson

※ 編輯: treasurehill (42.70.83.123 臺灣), 01/30/2025 16:03:02 ※ 編輯: treasurehill (42.70.83.123 臺灣), 01/30/2025 16:05:20 ※ 編輯: treasurehill (42.70.83.123 臺灣), 01/30/2025 16:07:39

→

William 01/30 16:13deepseek r1跟你的論文的實作就不同..先

→

William 01/30 16:13去看過deepseek的論文再來討論..

→

William 01/30 16:16或者講簡單一點，deepseek找到一個工程

→

William 01/30 16:16上實作的方式讓資料蒸餾避開這個問題或

→

William 01/30 16:16讓影影響降低，畢竟資料蒸餾也不是deeps

→

William 01/30 16:16eek 第一個實作，但是工程實務上參數跟