cost function ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ΅ ΠΎΠ±ΡΡΠ΅Π½ΠΈΠ΅
ΠΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½ΠΈΠ΅ ΠΌΠΎΠ΄Π΅Π»ΠΈ ΠΈ ΡΡΠ½ΠΊΡΠΈΡ ΡΡΠΎΠΈΠΌΠΎΡΡΠΈ ΠΠ°ΡΠΈΠ½Π½ΠΎΠ΅ ΠΎΠ±ΡΡΠ΅Π½ΠΈΠ΅
ΠΠ°ΡΠ° ΠΏΡΠ±Π»ΠΈΠΊΠ°ΡΠΈΠΈ Nov 13, 2019
ΠΡΠ°ΠΊ, Ρ Π½Π°ΡΠ°Π» Π΄Π΅Π»Π°ΡΡ ΠΏΠΎΠΏΡΠ»ΡΡΠ½ΡΠΉ ΠΊΡΡΡ ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ ΠΎΡ Coursera. Π ΠΏΠΎΠ΄ΡΠΌΠ°Π», Π° ΠΏΠΎΡΠ΅ΠΌΡ Π±Ρ Π½Π΅ ΠΏΠΎΠ΄Π΅Π»ΠΈΡΡΡΡ ΡΠ΅ΠΌ, ΡΡΠΎ Ρ ΠΈΠ·ΡΡΠ°Ρ? ΠΡΠ°ΠΊ, ΡΠ΅Π³ΠΎΠ΄Π½Ρ Ρ ΡΠΎΠ±ΠΈΡΠ°ΡΡΡ ΠΏΠΎΠ³ΠΎΠ²ΠΎΡΠΈΡΡ ΠΎ Π΄Π²ΡΡ ΡΡΠ½Π΄Π°ΠΌΠ΅Π½ΡΠ°Π»ΡΠ½ΡΡ ΡΠ΅ΠΌΠ°Ρ ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ.
ΠΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½ΠΈΠ΅ ΠΌΠΎΠ΄Π΅Π»ΠΈ:
ΠΠΈΠΏΠΎΡΠ΅Π·Π° ΠΎΠ±ΡΡΠ½ΠΎ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½Π°
ΠΠ΄Π΅ΡΡ theta1 ΠΈ theta2 ΡΠ²Π»ΡΡΡΡΡ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠ°ΠΌΠΈ.
ΠΠ°Π²Π°ΠΉΡΠ΅ ΠΏΡΠ΅Π΄ΡΡΠ°Π²ΠΈΠΌ Π½Π°ΡΡ Π³ΠΈΠΏΠΎΡΠ΅Π·Ρ Π½Π° ΠΏΡΠΈΠΌΠ΅ΡΠ΅ ΡΠΎ ΡΠ»ΡΡΠ°ΠΉΠ½ΡΠΌΠΈ Π·Π½Π°ΡΠ΅Π½ΠΈΡΠΌΠΈ Π΄Π»Ρ theta1, theta2 ΠΈ x:
Π΄Π°Π²Π°ΠΉΡΠ΅ ΠΏΠΎΡΠΌΠΎΡΡΠΈΠΌ Π½Π° Π΄ΡΡΠ³ΠΎΠΉ ΠΏΡΠΈΠΌΠ΅Ρ:
Π€ΡΠ½ΠΊΡΠΈΡ ΡΡΠΎΠΈΠΌΠΎΡΡΠΈ:
ΠΡΠΈΠΌΠ΅ΡΠ°Π½ΠΈΠ΅: ΡΡΠ΅Π΄Π½Π΅Π΅ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ ΡΠΌΠ΅Π½ΡΡΠ΅Π½ΠΎ Π²Π΄Π²ΠΎΠ΅ (1/2) Π΄Π»Ρ ΡΠ΄ΠΎΠ±ΡΡΠ²Π° Π²ΡΡΠΈΡΠ»Π΅Π½ΠΈΡ Π³ΡΠ°Π΄ΠΈΠ΅Π½ΡΠ½ΠΎΠ³ΠΎ ΡΠΏΡΡΠΊΠ°, ΠΏΠΎΡΠΊΠΎΠ»ΡΠΊΡ ΠΏΡΠΎΠΈΠ·Π²ΠΎΠ΄Π½ΡΠΉ ΡΠ»Π΅Π½ ΡΡΠ½ΠΊΡΠΈΠΈ ΠΊΠ²Π°Π΄ΡΠ°ΡΠ° ΠΎΡΠΌΠ΅Π½ΠΈΡ ΡΠ»Π΅Π½ (1/2).
ΠΡ Ρ ΠΎΡΠΈΠΌ ΠΏΠΎΠ»ΡΡΠΈΡΡ Π»ΡΡΡΡΡ Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΡΡ Π»ΠΈΠ½ΠΈΡ. ΠΠΎΠ³Π΄Π° ΡΡΠ΅Π΄Π½Π΅ΠΊΠ²Π°Π΄ΡΠ°ΡΠΈΡΠ΅ΡΠΊΠΈΠ΅ Π²Π΅ΡΡΠΈΠΊΠ°Π»ΡΠ½ΡΠ΅ ΡΠ°ΡΡΡΠΎΡΠ½ΠΈΡ ΡΠ°ΡΡΠ΅ΡΠ½Π½ΡΡ ΡΠΎΡΠ΅ΠΊ ΠΎΡ Π»ΠΈΠ½ΠΈΠΈ Π±ΡΠ΄ΡΡ ΠΌΠΈΠ½ΠΈΠΌΠ°Π»ΡΠ½ΡΠΌΠΈ, ΡΠΎΠ³Π΄Π° ΠΌΡ ΠΏΠΎΠ»ΡΡΠΈΠΌ Π½Π°ΠΈΠ»ΡΡΡΡΡ Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΡΡ Π»ΠΈΠ½ΠΈΡ. ΠΠ°Π²Π°ΠΉΡΠ΅ ΠΏΠΎΡΠΌΠΎΡΡΠΈΠΌ ΠΏΡΠΈΠΌΠ΅ΡΡ, Π³Π΄Π΅ Π½Π°ΡΠ° ΡΡΠ½ΠΊΡΠΈΡ ΡΡΠΎΠΈΠΌΠΎΡΡΠΈ Π±ΡΠ΄Π΅Ρ 0, ΠΈ Ρ Π½Π°Ρ Π±ΡΠ΄Π΅Ρ Π»ΡΡΡΠ°Ρ Π²ΠΎΠ·ΠΌΠΎΠΆΠ½Π°Ρ ΡΡΡΠΎΠΊΠ°:
ΠΠΎΠ³Π΄Π° theta1 = 1, ΠΌΡ ΠΏΠΎΠ»ΡΡΠ°Π΅ΠΌ Π½Π°ΠΊΠ»ΠΎΠ½ 1, Π° Π½Π°ΡΠ° ΡΡΠ½ΠΊΡΠΈΡ ΡΡΠΎΠΈΠΌΠΎΡΡΠΈ ΡΠ°Π²Π½Π° 0.
Π’Π΅ΠΏΠ΅ΡΡ Π΄Π°Π²Π°ΠΉΡΠ΅ ΠΏΡΠ΅Π΄ΠΏΠΎΠ»ΠΎΠΆΠΈΠΌ, ΡΡΠΎ theta1 = 0.5
ΠΡ ΠΌΠΎΠΆΠ΅ΠΌ Π²ΠΈΠ΄Π΅ΡΡ, ΡΡΠΎ ΡΡΠΎ ΡΠ²Π΅Π»ΠΈΡΠΈΠ²Π°Π΅Ρ Π½Π°ΡΡ ΡΡΠ½ΠΊΡΠΈΡ ΡΡΠΎΠΈΠΌΠΎΡΡΠΈ Π΄ΠΎ 0,5833
Π’Π΅ΠΏΠ΅ΡΡ ΠΌΡ Π±ΡΠ΄Π΅ΠΌ ΡΡΡΠΎΠΈΡΡ Π±ΠΎΠ»ΡΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ J (theta1):
ΠΎΡΠΊΠ°Π·Π― ΠΏΡΠΎΡΡΠΎ Π΄Π΅Π»ΡΡΡ ΡΠ΅ΠΌ, ΡΡΠΎ ΡΠ·Π½Π°Π» ΠΈΠ· ΠΈΠ·Π²Π΅ΡΡΠ½ΠΎΠ³ΠΎ ΠΊΡΡΡΠ° ΠΠ½Π΄ΡΡ ΠΠ³ ΠΏΠΎ ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠΌΡ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ. Π ΡΠ»Π΅Π΄ΡΡΡΠ΅ΠΉ ΡΡΠ°ΡΡΠ΅ Ρ Π±ΡΠ΄Ρ Π³ΠΎΠ²ΠΎΡΠΈΡΡ ΠΎ Π³ΡΠ°Π΄ΠΈΠ΅Π½ΡΠ½ΠΎΠΌ ΡΠΏΡΡΠΊΠ΅.
ΠΡΠΎΡΡ ΠΏΡΠΎΡΠ΅Π½ΠΈΡ Π·Π° ΠΌΠΎΠΉ Ρ ΡΠ΄ΡΠΈΠΉ (!) ΠΠΎΡΠ΅ΡΠΊ.
Π€ΡΠ½ΠΊΡΠΈΡ ΠΏΠΎΡΠ΅ΡΡ (Loss Function)
Π€ΡΠ½ΠΊΡΠΈΡ ΠΏΠΎΡΠ΅ΡΡ (Loss Function, Cost Function, Error Function; J) β ΡΡΠ°Π³ΠΌΠ΅Π½Ρ ΠΏΡΠΎΠ³ΡΠ°ΠΌΠΌΠ½ΠΎΠ³ΠΎ ΠΊΠΎΠ΄Π°, ΠΊΠΎΡΠΎΡΡΠΉ ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠ΅ΡΡΡ Π΄Π»Ρ ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΠΈ ΠΠ»Π³ΠΎΡΠΈΡΠΌΠ° (Algorithm) ΠΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ (ML). ΠΠ½Π°ΡΠ΅Π½ΠΈΠ΅, Π²ΡΡΠΈΡΠ»Π΅Π½Π½ΠΎΠ΅ ΡΠ°ΠΊΠΎΠΉ ΡΡΠ½ΠΊΡΠΈΠ΅ΠΉ, Π½Π°Π·ΡΠ²Π°Π΅ΡΡΡ Β«ΠΏΠΎΡΠ΅ΡΠ΅ΠΉΒ».
Π€ΡΠ½ΠΊΡΠΈΡ (Function) ΠΏΠΎΡΠ΅ΡΡ ΠΌΠΎΠΆΠ΅Ρ Π΄Π°ΡΡ Π±ΠΎΜΠ»ΡΡΡΡ ΠΏΡΠ°ΠΊΡΠΈΡΠ΅ΡΠΊΡΡ Π³ΠΈΠ±ΠΊΠΎΡΡΡ Π²Π°ΡΠΈΠΌ ΠΠ΅ΠΉΡΠΎΠ½Π½ΡΠΌ ΡΠ΅ΡΡΠΌ (Neural Network) ΠΈ Π±ΡΠ΄Π΅Ρ ΠΎΠΏΡΠ΅Π΄Π΅Π»ΡΡΡ, ΠΊΠ°ΠΊ ΠΈΠΌΠ΅Π½Π½ΠΎ Π²ΡΡ ΠΎΠ΄Π½ΡΠ΅ Π΄Π°Π½Π½ΡΠ΅ ΡΠ²ΡΠ·Π°Π½Ρ Ρ ΠΈΡΡ ΠΎΠ΄Π½ΡΠΌΠΈ.
ΠΠ΅ΠΉΡΠΎΠ½Π½ΡΠ΅ ΡΠ΅ΡΠΈ ΠΌΠΎΠ³ΡΡ Π²ΡΠΏΠΎΠ»Π½ΡΡΡ Π½Π΅ΡΠΊΠΎΠ»ΡΠΊΠΎ Π·Π°Π΄Π°Ρ: ΠΎΡ ΠΏΡΠΎΠ³Π½ΠΎΠ·ΠΈΡΠΎΠ²Π°Π½ΠΈΡ Π½Π΅ΠΏΡΠ΅ΡΡΠ²Π½ΡΡ Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ, ΡΠ°ΠΊΠΈΡ ΠΊΠ°ΠΊ Π΅ΠΆΠ΅ΠΌΠ΅ΡΡΡΠ½ΡΠ΅ ΡΠ°ΡΡ ΠΎΠ΄Ρ, Π΄ΠΎ ΠΠΈΠ½Π°ΡΠ½ΠΎΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ (Binary Classification) Π½Π° ΠΊΠΎΡΠ΅ΠΊ ΠΈ ΡΠΎΠ±Π°ΠΊ. ΠΠ»Ρ ΠΊΠ°ΠΆΠ΄ΠΎΠΉ ΠΎΡΠ΄Π΅Π»ΡΠ½ΠΎΠΉ Π·Π°Π΄Π°ΡΠΈ ΠΏΠΎΡΡΠ΅Π±ΡΡΡΡΡ ΡΠ°Π·Π½ΡΠ΅ ΡΠΈΠΏΡ ΡΡΠ½ΠΊΡΠΈΠΉ, ΠΏΠΎΡΠΊΠΎΠ»ΡΠΊΡ Π²ΡΡ ΠΎΠ΄Π½ΠΎΠΉ ΡΠΎΡΠΌΠ°Ρ ΠΈΠ½Π΄ΠΈΠ²ΠΈΠ΄ΡΠ°Π»Π΅Π½.
Π‘ ΠΎΡΠ΅Π½Ρ ΡΠΏΡΠΎΡΠ΅Π½Π½ΠΎΠΉ ΡΠΎΡΠΊΠΈ Π·ΡΠ΅Π½ΠΈΡ Loss Function ΠΌΠΎΠΆΠ΅Ρ Π±ΡΡΡ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½Π° ΠΊΠ°ΠΊ ΡΡΠ½ΠΊΡΠΈΡ, ΠΊΠΎΡΠΎΡΠ°Ρ ΠΏΡΠΈΠ½ΠΈΠΌΠ°Π΅Ρ Π΄Π²Π° ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠ°:
ΠΡΠ° ΡΡΠ½ΠΊΡΠΈΡ, ΠΏΠΎ ΡΡΡΠΈ, Π²ΡΡΠΈΡΠ»ΠΈΡ, Π½Π°ΡΠΊΠΎΠ»ΡΠΊΠΎ Ρ ΠΎΡΠΎΡΠΎ ΡΠ°Π±ΠΎΡΠ°Π΅Ρ Π½Π°ΡΠ° ΠΌΠΎΠ΄Π΅Π»Ρ, ΡΡΠ°Π²Π½ΠΈΠ² ΡΠΎ, ΡΡΠΎ ΠΌΠΎΠ΄Π΅Π»Ρ ΠΏΡΠΎΠ³Π½ΠΎΠ·ΠΈΡΡΠ΅Ρ, Ρ ΡΠ°ΠΊΡΠΈΡΠ΅ΡΠΊΠΈΠΌ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ΠΌ, ΠΊΠΎΡΠΎΡΠΎΠ΅ ΠΎΠ½Π° Π΄ΠΎΠ»ΠΆΠ½Π° Π²ΡΠ΄Π°Π΅Ρ. ΠΡΠ»ΠΈ Ypred ΠΎΡΠ΅Π½Ρ Π΄Π°Π»Π΅ΠΊΠΎ ΠΎΡ Yi, Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ ΠΏΠΎΡΠ΅ΡΡ Π±ΡΠ΄Π΅Ρ ΠΎΡΠ΅Π½Ρ Π²ΡΡΠΎΠΊΠΈΠΌ. ΠΠ΄Π½Π°ΠΊΠΎ, Π΅ΡΠ»ΠΈ ΠΎΠ±Π° Π·Π½Π°ΡΠ΅Π½ΠΈΡ ΠΏΠΎΡΡΠΈ ΠΎΠ΄ΠΈΠ½Π°ΠΊΠΎΠ²Ρ, Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ ΠΏΠΎΡΠ΅ΡΡ Π±ΡΠ΄Π΅Ρ ΠΎΡΠ΅Π½Ρ Π½ΠΈΠ·ΠΊΠΈΠΌ. Π‘Π»Π΅Π΄ΠΎΠ²Π°ΡΠ΅Π»ΡΠ½ΠΎ, Π½Π°ΠΌ Π½ΡΠΆΠ½ΠΎ ΡΠΎΡ ΡΠ°Π½ΠΈΡΡ ΡΡΠ½ΠΊΡΠΈΡ ΠΏΠΎΡΠ΅ΡΡ, ΠΊΠΎΡΠΎΡΠ°Ρ ΠΌΠΎΠΆΠ΅Ρ ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎ Π½Π°ΠΊΠ°Π·ΡΠ²Π°ΡΡ ΠΌΠΎΠ΄Π΅Π»Ρ, ΠΏΠΎΠΊΠ° ΡΠ° ΠΎΠ±ΡΡΠ°Π΅ΡΡΡ Π½Π° Π’ΡΠ΅Π½ΠΈΡΠΎΠ²ΠΎΡΠ½ΡΡ Π΄Π°Π½Π½ΡΡ (Train Data).
ΠΡΠΎΡ ΡΡΠ΅Π½Π°ΡΠΈΠΉ Π² ΡΠ΅ΠΌ-ΡΠΎ Π°Π½Π°Π»ΠΎΠ³ΠΈΡΠ΅Π½ ΠΏΠΎΠ΄Π³ΠΎΡΠΎΠ²ΠΊΠ΅ ΠΊ ΡΠΊΠ·Π°ΠΌΠ΅Π½Π°ΠΌ. ΠΡΠ»ΠΈ ΠΊΡΠΎ-ΡΠΎ ΠΏΠ»ΠΎΡ ΠΎ ΡΠ΄Π°Π΅Ρ ΡΠΊΠ·Π°ΠΌΠ΅Π½, ΠΌΡ ΠΌΠΎΠΆΠ΅ΠΌ ΡΠΊΠ°Π·Π°ΡΡ, ΡΡΠΎ ΠΏΠΎΡΠ΅ΡΡ ΠΎΡΠ΅Π½Ρ Π²ΡΡΠΎΠΊΠ°, ΠΈ ΡΡΠΎΠΌΡ ΡΠ΅Π»ΠΎΠ²Π΅ΠΊΡ ΠΏΡΠΈΠ΄Π΅ΡΡΡ ΠΌΠ½ΠΎΠ³ΠΎΠ΅ ΠΈΠ·ΠΌΠ΅Π½ΠΈΡΡ Π²Π½ΡΡΡΠΈ ΡΠ΅Π±Ρ, ΡΡΠΎΠ±Ρ Π² ΡΠ»Π΅Π΄ΡΡΡΠΈΠΉ ΡΠ°Π· ΠΏΠΎΠ»ΡΡΠΈΡΡ Π»ΡΡΡΡΡ ΠΎΡΠ΅Π½ΠΊΡ. ΠΠ΄Π½Π°ΠΊΠΎ, Π΅ΡΠ»ΠΈ ΡΠΊΠ·Π°ΠΌΠ΅Π½ ΠΏΡΠΎΠΉΠ΄Π΅Ρ Ρ ΠΎΡΠΎΡΠΎ, ΡΡΡΠ΄Π΅Π½Ρ ΠΌΠΎΠΆΠ΅Ρ Π²Π΅ΡΡΠΈ ΡΠ΅Π±Ρ ΠΏΠΎΠ΄ΠΎΠ±Π½ΡΠΌ ΠΎΠ±ΡΠ°Π·ΠΎΠΌ ΠΈ Π² ΡΠ»Π΅Π΄ΡΡΡΠΈΠΉ ΡΠ°Π·.
Π’Π΅ΠΏΠ΅ΡΡ Π΄Π°Π²Π°ΠΉΡΠ΅ ΡΠ°ΡΡΠΌΠΎΡΡΠΈΠΌ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΡ ΠΊΠ°ΠΊ Π·Π°Π΄Π°ΡΡ ΠΈ ΠΏΠΎΠΉΠΌΠ΅ΠΌ, ΠΊΠ°ΠΊ Π² ΡΡΠΎΠΌ ΡΠ»ΡΡΠ°Π΅ ΡΠ°Π±ΠΎΡΠ°Π΅Ρ ΡΡΠ½ΠΊΡΠΈΡ ΠΏΠΎΡΠ΅ΡΡ.
ΠΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΎΠ½Π½ΡΠ΅ ΠΏΠΎΡΠ΅ΡΠΈ
ΠΠΎΠ³Π΄Π° Π½Π΅ΠΉΡΠΎΠ½Π½Π°Ρ ΡΠ΅ΡΡ ΠΏΡΡΠ°Π΅ΡΡΡ ΠΏΡΠ΅Π΄ΡΠΊΠ°Π·Π°ΡΡ Π΄ΠΈΡΠΊΡΠ΅ΡΠ½ΠΎΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅, ΠΌΡ ΡΠ°ΡΡΠΌΠ°ΡΡΠΈΠ²Π°Π΅ΠΌ ΡΡΠΎ ΠΊΠ°ΠΊ ΠΌΠΎΠ΄Π΅Π»Ρ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ. ΠΡΠΎ ΠΌΠΎΠΆΠ΅Ρ Π±ΡΡΡ ΡΠ΅ΡΡ, ΠΏΡΡΠ°ΡΡΠ°ΡΡΡ ΠΏΡΠ΅Π΄ΡΠΊΠ°Π·Π°ΡΡ, ΠΊΠ°ΠΊΠΎΠ΅ ΠΆΠΈΠ²ΠΎΡΠ½ΠΎΠ΅ ΠΏΡΠΈΡΡΡΡΡΠ²ΡΠ΅Ρ Π½Π° ΠΈΠ·ΠΎΠ±ΡΠ°ΠΆΠ΅Π½ΠΈΠΈ, ΠΈΠ»ΠΈ ΡΠ²Π»ΡΠ΅ΡΡΡ Π»ΠΈ ΡΠ»Π΅ΠΊΡΡΠΎΠ½Π½ΠΎΠ΅ ΠΏΠΈΡΡΠΌΠΎ ΡΠΏΠ°ΠΌΠΎΠΌ. Π‘Π½Π°ΡΠ°Π»Π° Π΄Π°Π²Π°ΠΉΡΠ΅ ΠΏΠΎΡΠΌΠΎΡΡΠΈΠΌ, ΠΊΠ°ΠΊ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½Ρ Π²ΡΡ ΠΎΠ΄Π½ΡΠ΅ Π΄Π°Π½Π½ΡΠ΅ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΎΠ½Π½ΠΎΠΉ Π½Π΅ΠΉΡΠΎΠ½Π½ΠΎΠΉ ΡΠ΅ΡΠΈ.
ΠΡΡ
ΠΎΠ΄Π½ΠΎΠΉ ΡΠΎΡΠΌΠ°Ρ Π΄Π°Π½Π½ΡΡ
Π½Π΅ΠΉΡΠΎΡΠ΅ΡΠΈ Π±ΠΈΠ½Π°ΡΠ½ΠΎΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ
ΠΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ ΡΠ·Π»ΠΎΠ² Π²ΡΡ ΠΎΠ΄Π½ΠΎΠ³ΠΎ ΡΠ»ΠΎΡ Π±ΡΠ΄Π΅Ρ Π·Π°Π²ΠΈΡΠ΅ΡΡ ΠΎΡ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²Π° ΠΊΠ»Π°ΡΡΠΎΠ², ΠΏΡΠΈΡΡΡΡΡΠ²ΡΡΡΠΈΡ Π² Π΄Π°Π½Π½ΡΡ . ΠΠ°ΠΆΠ΄ΡΠΉ ΡΠ·Π΅Π» Π±ΡΠ΄Π΅Ρ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»ΡΡΡ ΠΎΠ΄ΠΈΠ½ ΠΊΠ»Π°ΡΡ. ΠΠ½Π°ΡΠ΅Π½ΠΈΠ΅ ΠΊΠ°ΠΆΠ΄ΠΎΠ³ΠΎ Π²ΡΡ ΠΎΠ΄Π½ΠΎΠ³ΠΎ ΡΠ·Π»Π° ΠΏΠΎ ΡΡΡΠ΅ΡΡΠ²Ρ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»ΡΠ΅Ρ Π²Π΅ΡΠΎΡΡΠ½ΠΎΡΡΡ ΡΠΎΠ³ΠΎ, ΡΡΠΎ ΡΡΠΎΡ ΠΊΠ»Π°ΡΡ ΡΠ²Π»ΡΠ΅ΡΡΡ ΠΏΡΠ°Π²ΠΈΠ»ΡΠ½ΡΠΌ.
ΠΠ°ΠΊ ΡΠΎΠ»ΡΠΊΠΎ ΠΌΡ ΠΏΠΎΠ»ΡΡΠΈΠΌ Π²Π΅ΡΠΎΡΡΠ½ΠΎΡΡΠΈ Π²ΡΠ΅Ρ ΡΠ°Π·Π»ΠΈΡΠ½ΡΡ ΠΊΠ»Π°ΡΡΠΎΠ², ΡΠ°ΡΡΠΌΠΎΡΡΠΈΠΌ ΡΠΎΡ, ΡΡΠΎ ΠΈΠΌΠ΅Π΅Ρ Π½Π°ΠΈΠ±ΠΎΠ»ΡΡΡΡ Π²Π΅ΡΠΎΡΡΠ½ΠΎΡΡΡ. ΠΠΎΡΠΌΠΎΡΡΠΈΠΌ, ΠΊΠ°ΠΊ Π²ΡΠΏΠΎΠ»Π½ΡΠ΅ΡΡΡ Π΄Π²ΠΎΠΈΡΠ½Π°Ρ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΡ.
ΠΠΈΠ½Π°ΡΠ½Π°Ρ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΡ
Π Π΄Π²ΠΎΠΈΡΠ½ΠΎΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ Π½Π° Π²ΡΡ ΠΎΠ΄Π½ΠΎΠΌ ΡΠ»ΠΎΠ΅ Π±ΡΠ΄Π΅Ρ ΡΠΎΠ»ΡΠΊΠΎ ΠΎΠ΄ΠΈΠ½ ΡΠ·Π΅Π». Π§ΡΠΎΠ±Ρ ΠΏΠΎΠ»ΡΡΠΈΡΡ ΡΠ΅Π·ΡΠ»ΡΡΠ°Ρ Π² ΡΠΎΡΠΌΠ°ΡΠ΅ Π²Π΅ΡΠΎΡΡΠ½ΠΎΡΡΠΈ, Π½Π°ΠΌ Π½ΡΠΆΠ½ΠΎ ΠΏΡΠΈΠΌΠ΅Π½ΠΈΡΡ Π€ΡΠ½ΠΊΡΠΈΡ Π°ΠΊΡΠΈΠ²Π°ΡΠΈΠΈ (Activation Function). ΠΠΎΡΠΊΠΎΠ»ΡΠΊΡ Π΄Π»Ρ Π²Π΅ΡΠΎΡΡΠ½ΠΎΡΡΠΈ ΡΡΠ΅Π±ΡΠ΅ΡΡΡ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ ΠΎΡ 0 Π΄ΠΎ 1, ΠΌΡ Π±ΡΠ΄Π΅ΠΌ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡ Π‘ΠΈΠ³ΠΌΠΎΠΈΠ΄ (Sigmoid), ΠΊΠΎΡΠΎΡΠ°Ρ ΠΏΡΠΈΠ²Π΅Π΄Π΅Ρ Π»ΡΠ±ΠΎΠ΅ ΡΠ΅Π°Π»ΡΠ½ΠΎΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ ΠΊ Π΄ΠΈΠ°ΠΏΠ°Π·ΠΎΠ½Ρ Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ ΠΎΡ 0 Π΄ΠΎ 1.
ΠΠΈΠ·ΡΠ°Π»ΠΈΠ·Π°ΡΠΈΡ ΠΏΡΠ΅ΠΎΠ±ΡΠ°Π·ΠΎΠ²Π°Π½ΠΈΡ Π·Π½Π°ΡΠ΅Π½ΠΈΡ ΡΠΈΠ³ΠΌΠΎΠΈΠ΄ΠΎΠΌ
ΠΠΎ ΠΌΠ΅ΡΠ΅ ΡΠΎΠ³ΠΎ, ΠΊΠ°ΠΊ Π²Ρ ΠΎΠ΄Π½ΡΠ΅ ΡΠ΅Π°Π»ΡΠ½ΡΠ΅ Π΄Π°Π½Π½ΡΠ΅ ΡΡΠ°Π½ΠΎΠ²ΡΡΡΡ Π±ΠΎΠ»ΡΡΠ΅ ΠΈ ΡΡΡΠ΅ΠΌΡΡΡΡ ΠΊ ΠΏΠ»ΡΡ Π±Π΅ΡΠΊΠΎΠ½Π΅ΡΠ½ΠΎΡΡΠΈ, Π²ΡΡ ΠΎΠ΄Π½ΡΠ΅ Π΄Π°Π½Π½ΡΠ΅ ΡΠΈΠ³ΠΌΠΎΠΈΠ΄Π° Π±ΡΠ΄ΡΡ ΡΡΡΠ΅ΠΌΠΈΡΡΡΡ ΠΊ Π΅Π΄ΠΈΠ½ΠΈΡΠ΅. Π ΠΊΠΎΠ³Π΄Π° Π½Π° Π²Ρ ΠΎΠ΄Π΅ Π·Π½Π°ΡΠ΅Π½ΠΈΡ ΡΡΠ°Π½ΠΎΠ²ΡΡΡΡ ΠΌΠ΅Π½ΡΡΠ΅ ΠΈ ΡΡΡΠ΅ΠΌΡΡΡΡ ΠΊ ΠΎΡΡΠΈΡΠ°ΡΠ΅Π»ΡΠ½ΠΎΠΉ Π±Π΅ΡΠΊΠΎΠ½Π΅ΡΠ½ΠΎΡΡΠΈ, Π½Π° Π²ΡΡ ΠΎΠ΄Π΅ ΡΠΈΡΠ»Π° Π±ΡΠ΄ΡΡ ΡΡΡΠ΅ΠΌΠΈΡΡΡΡ ΠΊ Π½ΡΠ»Ρ. Π’Π΅ΠΏΠ΅ΡΡ ΠΌΡ Π³Π°ΡΠ°Π½ΡΠΈΡΠΎΠ²Π°Π½Π½ΠΎ ΠΏΠΎΠ»ΡΡΠ°Π΅ΠΌ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ ΠΎΡ 0 Π΄ΠΎ 1, ΠΈ ΡΡΠΎ ΠΈΠΌΠ΅Π½Π½ΠΎ ΡΠΎ, ΡΡΠΎ Π½Π°ΠΌ Π½ΡΠΆΠ½ΠΎ, ΠΏΠΎΡΠΊΠΎΠ»ΡΠΊΡ Π½Π°ΠΌ Π½ΡΠΆΠ½Ρ Π²Π΅ΡΠΎΡΡΠ½ΠΎΡΡΠΈ.
ΠΡΠ»ΠΈ Π²ΡΡ ΠΎΠ΄ Π²ΡΡΠ΅ 0,5 (Π²Π΅ΡΠΎΡΡΠ½ΠΎΡΡΡ 50%), ΠΌΡ Π±ΡΠ΄Π΅ΠΌ ΡΡΠΈΡΠ°ΡΡ, ΡΡΠΎ ΠΎΠ½ ΠΏΠΎΠΏΠ°Π΄Π°Π΅Ρ Π² ΠΏΠΎΠ»ΠΎΠΆΠΈΡΠ΅Π»ΡΠ½ΡΠΉ ΠΊΠ»Π°ΡΡ, Π° Π΅ΡΠ»ΠΈ ΠΎΠ½ Π½ΠΈΠΆΠ΅ 0,5, ΠΌΡ Π±ΡΠ΄Π΅ΠΌ ΡΡΠΈΡΠ°ΡΡ, ΡΡΠΎ ΠΎΠ½ ΠΏΠΎΠΏΠ°Π΄Π°Π΅Ρ Π² ΠΎΡΡΠΈΡΠ°ΡΠ΅Π»ΡΠ½ΡΠΉ ΠΊΠ»Π°ΡΡ. ΠΠ°ΠΏΡΠΈΠΌΠ΅Ρ, Π΅ΡΠ»ΠΈ ΠΌΡ ΠΎΠ±ΡΡΠ°Π΅ΠΌ Π½Π΅ΠΉΡΠΎΡΠ΅ΡΡ Π΄Π»Ρ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΠΊΠΎΡΠ΅ΠΊ ΠΈ ΡΠΎΠ±Π°ΠΊ, ΠΌΡ ΠΌΠΎΠΆΠ΅ΠΌ Π½Π°Π·Π½Π°ΡΠΈΡΡ ΡΠΎΠ±Π°ΠΊΠ°ΠΌ ΠΏΠΎΠ»ΠΎΠΆΠΈΡΠ΅Π»ΡΠ½ΡΠΉ ΠΊΠ»Π°ΡΡ, ΠΈ Π²ΡΡ ΠΎΠ΄Π½ΠΎΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ Π² Π½Π°Π±ΠΎΡΠ΅ Π΄Π°Π½Π½ΡΡ Π΄Π»Ρ ΡΠΎΠ±Π°ΠΊ Π±ΡΠ΄Π΅Ρ ΡΠ°Π²Π½ΠΎ 1, Π°Π½Π°Π»ΠΎΠ³ΠΈΡΠ½ΠΎ ΠΊΠΎΡΠΊΠ°ΠΌ Π±ΡΠ΄Π΅Ρ Π½Π°Π·Π½Π°ΡΠ΅Π½ ΠΎΡΡΠΈΡΠ°ΡΠ΅Π»ΡΠ½ΡΠΉ ΠΊΠ»Π°ΡΡ, Π° Π²ΡΡ ΠΎΠ΄Π½ΠΎΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ Π΄Π»Ρ ΠΊΠΎΡΠ΅ΠΊ Π±ΡΠ΄Π΅Ρ Π±ΡΡΡ 0.
Π€ΡΠ½ΠΊΡΠΈΡ ΠΏΠΎΡΠ΅ΡΡ, ΠΊΠΎΡΠΎΡΡΡ ΠΌΡ ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠ΅ΠΌ Π΄Π»Ρ Π΄Π²ΠΎΠΈΡΠ½ΠΎΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ, Π½Π°Π·ΡΠ²Π°Π΅ΡΡΡ ΠΠ²ΠΎΠΈΡΠ½ΠΎΠΉ ΠΏΠ΅ΡΠ΅ΠΊΡΠ΅ΡΡΠ½ΠΎΠΉ ΡΠ½ΡΡΠΎΠΏΠΈΠ΅ΠΉ (BCE). ΠΡΠ° ΡΡΠ½ΠΊΡΠΈΡ ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎ Π½Π°ΠΊΠ°Π·ΡΠ²Π°Π΅Ρ Π½Π΅ΠΉΡΠΎΠ½Π½ΡΡ ΡΠ΅ΡΡ Π·Π° ΠΡΠΈΠ±ΠΊΠΈ (Error) Π΄Π²ΠΎΠΈΡΠ½ΠΎΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ. ΠΠ°Π²Π°ΠΉΡΠ΅ ΠΏΠΎΡΠΌΠΎΡΡΠΈΠΌ, ΠΊΠ°ΠΊ ΠΎΠ½Π° Π²ΡΠ³Π»ΡΠ΄ΠΈΡ.
ΠΡΠ°ΡΠΈΠΊΠΈ ΠΏΠΎΡΠ΅ΡΠΈ Π±ΠΈΠ½Π°ΡΠ½ΠΎΠΉ ΠΊΡΠΎΡΡ-ΡΠ½ΡΡΠΎΠΏΠΈΠΈ
ΠΠ°ΠΊ Π²ΠΈΠ΄ΠΈΡΠ΅, Π΅ΡΡΡ Π΄Π²Π΅ ΠΎΡΠ΄Π΅Π»ΡΠ½ΡΠ΅ ΡΡΠ½ΠΊΡΠΈΠΈ, ΠΏΠΎ ΠΎΠ΄Π½ΠΎΠΉ Π΄Π»Ρ ΠΊΠ°ΠΆΠ΄ΠΎΠ³ΠΎ Π·Π½Π°ΡΠ΅Π½ΠΈΡ Y. ΠΠΎΠ³Π΄Π° Π½Π°ΠΌ Π½ΡΠΆΠ½ΠΎ ΠΏΡΠ΅Π΄ΡΠΊΠ°Π·Π°ΡΡ ΠΏΠΎΠ»ΠΎΠΆΠΈΡΠ΅Π»ΡΠ½ΡΠΉ ΠΊΠ»Π°ΡΡ (Y = 1), ΠΌΡ Π±ΡΠ΄Π΅ΠΌ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡ ΡΠ»Π΅Π΄ΡΡΡΡΡ ΡΠΎΡΠΌΡΠ»Ρ:
Π ΠΊΠΎΠ³Π΄Π° Π½Π°ΠΌ Π½ΡΠΆΠ½ΠΎ ΠΏΡΠ΅Π΄ΡΠΊΠ°Π·Π°ΡΡ ΠΎΡΡΠΈΡΠ°ΡΠ΅Π»ΡΠ½ΡΠΉ ΠΊΠ»Π°ΡΡ (Y = 0), ΠΌΡ Π±ΡΠ΄Π΅ΠΌ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡ Π½Π΅ΠΌΠ½ΠΎΠ³ΠΎ ΡΡΠ°Π½ΡΡΠΎΡΠΌΠΈΡΠΎΠ²Π°Π½Π½ΡΠΉ Π°Π½Π°Π»ΠΎΠ³:
ΠΠ»Ρ ΠΏΠ΅ΡΠ²ΠΎΠΉ ΡΡΠ½ΠΊΡΠΈΠΈ, ΠΊΠΎΠ³Π΄Π° Ypred ΡΠ°Π²Π½ΠΎ 1, ΠΏΠΎΡΠ΅ΡΡ ΡΠ°Π²Π½Π° 0, ΡΡΠΎ ΠΈΠΌΠ΅Π΅Ρ ΡΠΌΡΡΠ», ΠΏΠΎΡΠΎΠΌΡ ΡΡΠΎ Ypred ΡΠΎΡΠ½ΠΎ ΡΠ°ΠΊΠΎΠ΅ ΠΆΠ΅, ΠΊΠ°ΠΊ Y. ΠΠΎΠ³Π΄Π° Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ Ypred ΡΡΠ°Π½ΠΎΠ²ΠΈΡΡΡ Π±Π»ΠΈΠΆΠ΅ ΠΊ 0, ΠΌΡ ΠΌΠΎΠΆΠ΅ΠΌ Π½Π°Π±Π»ΡΠ΄Π°ΡΡ, ΠΊΠ°ΠΊ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ ΠΏΠΎΡΠ΅ΡΠΈ ΡΠΈΠ»ΡΠ½ΠΎ ΡΠ²Π΅Π»ΠΈΡΠΈΠ²Π°Π΅ΡΡΡ. ΠΠΎΠ³Π΄Π° ΠΆΠ΅ Ypred ΡΡΠ°Π½ΠΎΠ²ΠΈΡΡΡ ΡΠ°Π²Π½ΡΠΌ 0, ΠΏΠΎΡΠ΅ΡΡ ΡΡΡΠ΅ΠΌΠΈΡΡΡ ΠΊ Π±Π΅ΡΠΊΠΎΠ½Π΅ΡΠ½ΠΎΡΡΠΈ. ΠΡΠΎ ΠΏΡΠΎΠΈΡΡ ΠΎΠ΄ΠΈΡ, ΠΏΠΎΡΠΎΠΌΡ ΡΡΠΎ Ρ ΡΠΎΡΠΊΠΈ Π·ΡΠ΅Π½ΠΈΡ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ, 0 ΠΈ 1 β ΠΏΠΎΠ»ΡΡΠ½ΡΠ΅ ΠΏΡΠΎΡΠΈΠ²ΠΎΠΏΠΎΠ»ΠΎΠΆΠ½ΠΎΡΡΠΈ: ΠΊΠ°ΠΆΠ΄ΡΠΉ ΠΈΠ· Π½ΠΈΡ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»ΡΠ΅Ρ ΡΠΎΠ²Π΅ΡΡΠ΅Π½Π½ΠΎ ΡΠ°Π·Π½ΡΠ΅ ΠΊΠ»Π°ΡΡΡ. ΠΠΎΡΡΠΎΠΌΡ, ΠΊΠΎΠ³Π΄Π° Ypred ΡΠ°Π²Π½ΠΎ 0, Π° Y ΡΠ°Π²Π½ΠΎ 1, ΠΏΠΎΡΠ΅ΡΠΈ Π΄ΠΎΠ»ΠΆΠ½Ρ Π±ΡΡΡ ΠΎΡΠ΅Π½Ρ Π²ΡΡΠΎΠΊΠΈΠΌΠΈ, ΡΡΠΎΠ±Ρ ΡΠ΅ΡΡ ΠΌΠΎΠ³Π»Π° Π±ΠΎΠ»Π΅Π΅ ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°ΡΡ ΡΠ²ΠΎΠΈ ΠΎΡΠΈΠ±ΠΊΠΈ.
Π‘ΡΠ°Π²Π½Π΅Π½ΠΈΠ΅ ΠΏΠΎΡΠ΅ΡΡ Π΄Π²ΠΎΠΈΡΠ½ΠΎΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ
ΠΠΎΠ»ΠΈΠ½ΠΎΠΌΠΈΠ°Π»ΡΠ½Π°Ρ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΡ
ΠΠΎΠ»ΠΈΠ½ΠΎΠΌΠΈΠ°Π»ΡΠ½Π°Ρ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΡ (Multiclass Classification) ΠΏΠΎΠ΄Ρ ΠΎΠ΄ΠΈΡ, ΠΊΠΎΠ³Π΄Π° Π½Π°ΠΌ Π½ΡΠΆΠ½ΠΎ, ΡΡΠΎΠ±Ρ Π½Π°ΡΠ° ΠΌΠΎΠ΄Π΅Π»Ρ ΠΊΠ°ΠΆΠ΄ΡΠΉ ΡΠ°Π· ΠΏΡΠ΅Π΄ΡΠΊΠ°Π·ΡΠ²Π°Π»Π° ΠΎΠ΄ΠΈΠ½ Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΡΠΉ ΠΊΠ»Π°ΡΡ. Π’Π΅ΠΏΠ΅ΡΡ, ΠΏΠΎΡΠΊΠΎΠ»ΡΠΊΡ ΠΌΡ Π²ΡΠ΅ Π΅ΡΠ΅ ΠΈΠΌΠ΅Π΅ΠΌ Π΄Π΅Π»ΠΎ Ρ Π²Π΅ΡΠΎΡΡΠ½ΠΎΡΡΡΠΌΠΈ, ΠΈΠΌΠ΅Π΅Ρ ΡΠΌΡΡΠ» ΠΏΡΠΎΡΡΠΎ ΠΏΡΠΈΠΌΠ΅Π½ΠΈΡΡ ΡΠΈΠ³ΠΌΠΎΠΈΠ΄ ΠΊΠΎ Π²ΡΠ΅ΠΌ Π²ΡΡ ΠΎΠ΄Π½ΡΠΌ ΡΠ·Π»Π°ΠΌ, ΡΡΠΎΠ±Ρ ΠΌΡ ΠΏΠΎΠ»ΡΡΠ°Π»ΠΈ Π·Π½Π°ΡΠ΅Π½ΠΈΡ ΠΎΡ 0 Π΄ΠΎ 1 Π΄Π»Ρ Π²ΡΠ΅Ρ Π²ΡΡ ΠΎΠ΄Π½ΡΡ Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ, Π½ΠΎ Π·Π΄Π΅ΡΡ ΠΊΡΠΎΠ΅ΡΡΡ ΠΏΡΠΎΠ±Π»Π΅ΠΌΠ°. ΠΠΎΠ³Π΄Π° ΠΌΡ ΡΠ°ΡΡΠΌΠ°ΡΡΠΈΠ²Π°Π΅ΠΌ Π²Π΅ΡΠΎΡΡΠ½ΠΎΡΡΠΈ Π΄Π»Ρ Π½Π΅ΡΠΊΠΎΠ»ΡΠΊΠΈΡ ΠΊΠ»Π°ΡΡΠΎΠ², Π½Π°ΠΌ Π½Π΅ΠΎΠ±Ρ ΠΎΠ΄ΠΈΠΌΠΎ ΡΠ±Π΅Π΄ΠΈΡΡΡΡ, ΡΡΠΎ ΡΡΠΌΠΌΠ° Π²ΡΠ΅Ρ ΠΈΠ½Π΄ΠΈΠ²ΠΈΠ΄ΡΠ°Π»ΡΠ½ΡΡ Π²Π΅ΡΠΎΡΡΠ½ΠΎΡΡΠ΅ΠΉ ΡΠ°Π²Π½Π° Π΅Π΄ΠΈΠ½ΠΈΡΠ΅, ΠΏΠΎΡΠΊΠΎΠ»ΡΠΊΡ ΠΈΠΌΠ΅Π½Π½ΠΎ ΡΠ°ΠΊ ΠΎΠΏΡΠ΅Π΄Π΅Π»ΡΠ΅ΡΡΡ Π²Π΅ΡΠΎΡΡΠ½ΠΎΡΡΡ. ΠΡΠΈΠΌΠ΅Π½Π΅Π½ΠΈΠ΅ ΡΠΈΠ³ΠΌΠΎΠΈΠ΄Π° Π½Π΅ Π³Π°ΡΠ°Π½ΡΠΈΡΡΠ΅Ρ, ΡΡΠΎ ΡΡΠΌΠΌΠ° Π²ΡΠ΅Π³Π΄Π° ΡΠ°Π²Π½Π° Π΅Π΄ΠΈΠ½ΠΈΡΠ΅, ΠΏΠΎΡΡΠΎΠΌΡ Π½Π°ΠΌ Π½ΡΠΆΠ½ΠΎ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡ Π΄ΡΡΠ³ΡΡ ΡΡΠ½ΠΊΡΠΈΡ Π°ΠΊΡΠΈΠ²Π°ΡΠΈΠΈ.
Π Π΄Π°Π½Π½ΠΎΠΌ ΡΠ»ΡΡΠ°Π΅ ΠΌΡ ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠ΅ΠΌ ΡΡΠ½ΠΊΡΠΈΡ Π°ΠΊΡΠΈΠ²Π°ΡΠΈΠΈ Softmax. ΠΡΠ° ΡΡΠ½ΠΊΡΠΈΡ Π³Π°ΡΠ°Π½ΡΠΈΡΡΠ΅Ρ, ΡΡΠΎ Π²ΡΠ΅ Π²ΡΡ ΠΎΠ΄Π½ΡΠ΅ ΡΠ·Π»Ρ ΠΈΠΌΠ΅ΡΡ Π·Π½Π°ΡΠ΅Π½ΠΈΡ ΠΎΡ 0 Π΄ΠΎ 1, Π° ΡΡΠΌΠΌΠ° Π²ΡΠ΅Ρ Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ Π²ΡΡ ΠΎΠ΄Π½ΡΡ ΡΠ·Π»ΠΎΠ² Π²ΡΠ΅Π³Π΄Π° ΡΠ°Π²Π½Π° 1. ΠΡΡΠΈΡΠ»ΡΠ΅ΡΡΡ Ρ ΠΏΠΎΠΌΠΎΡΡΡ ΡΠΎΡΠΌΡΠ»Ρ:
ΠΠ°ΠΊ Π²ΠΈΠ΄ΠΈΡΠ΅, ΠΌΡ ΠΏΡΠΎΡΡΠΎ ΠΏΠ΅ΡΠ΅Π΄Π°Π΅ΠΌ Π²ΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΡ Π² ΡΠΊΡΠΏΠΎΠ½Π΅Π½ΡΠΈΠ°Π»ΡΠ½ΡΡ ΡΡΠ½ΠΊΡΠΈΡ. ΠΠΎΡΠ»Π΅ ΡΡΠΎΠ³ΠΎ, ΡΡΠΎΠ±Ρ ΡΠ±Π΅Π΄ΠΈΡΡΡΡ, ΡΡΠΎ Π²ΡΠ΅ ΠΎΠ½ΠΈ Π½Π°Ρ ΠΎΠ΄ΡΡΡΡ Π² Π΄ΠΈΠ°ΠΏΠ°Π·ΠΎΠ½Π΅ ΠΎΡ 0 Π΄ΠΎ 1 ΠΈ ΡΡΠΌΠΌΠ° Π²ΡΠ΅Ρ Π²ΡΡ ΠΎΠ΄Π½ΡΡ Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ ΡΠ°Π²Π½Π° 1, ΠΌΡ ΠΏΡΠΎΡΡΠΎ Π΄Π΅Π»ΠΈΠΌ ΠΊΠ°ΠΆΠ΄ΡΡ ΡΠΊΡΠΏΠΎΠ½Π΅Π½ΡΡ Π½Π° ΡΡΠΌΠΌΡ ΡΠΊΡΠΏΠΎΠ½Π΅Π½Ρ.
ΠΡΠ°ΠΊ, ΠΏΠΎΡΠ΅ΠΌΡ ΠΌΡ Π΄ΠΎΠ»ΠΆΠ½Ρ ΠΏΠ΅ΡΠ΅Π΄Π°Π²Π°ΡΡ ΠΊΠ°ΠΆΠ΄ΠΎΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ ΡΠ΅ΡΠ΅Π· ΡΠΊΡΠΏΠΎΠ½Π΅Π½ΡΡ ΠΏΠ΅ΡΠ΅Π΄ ΠΈΡ Π½ΠΎΡΠΌΠ°Π»ΠΈΠ·Π°ΡΠΈΠ΅ΠΉ? ΠΠΎΡΠ΅ΠΌΡ ΠΌΡ Π½Π΅ ΠΌΠΎΠΆΠ΅ΠΌ ΠΏΡΠΎΡΡΠΎ Π½ΠΎΡΠΌΠ°Π»ΠΈΠ·ΠΎΠ²Π°ΡΡ ΡΠ°ΠΌΠΈ Π·Π½Π°ΡΠ΅Π½ΠΈΡ? ΠΡΠΎ ΡΠ²ΡΠ·Π°Π½ΠΎ Ρ ΡΠ΅ΠΌ, ΡΡΠΎ ΡΠ΅Π»Ρ Softmax β ΡΠ±Π΅Π΄ΠΈΡΡΡΡ, ΡΡΠΎ ΠΎΠ΄Π½ΠΎ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ ΠΎΡΠ΅Π½Ρ Π²ΡΡΠΎΠΊΠΎΠ΅ (Π±Π»ΠΈΠ·ΠΊΠΎ ΠΊ 1), Π° Π²ΡΠ΅ ΠΎΡΡΠ°Π»ΡΠ½ΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΡ ΠΎΡΠ΅Π½Ρ Π½ΠΈΠ·ΠΊΠΈΠ΅ (Π±Π»ΠΈΠ·ΠΊΠΎ ΠΊ 0). Softmax ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠ΅Ρ ΡΠΊΡΠΏΠΎΠ½Π΅Π½ΡΡ, ΡΡΠΎΠ±Ρ ΡΠ±Π΅Π΄ΠΈΡΡΡΡ, ΡΡΠΎ ΡΡΠΎ ΠΏΡΠΎΠΈΠ·ΠΎΠΉΠ΄Π΅Ρ. Π Π·Π°ΡΠ΅ΠΌ ΠΌΡ Π½ΠΎΡΠΌΠ°Π»ΠΈΠ·ΡΠ΅ΠΌ ΡΠ΅Π·ΡΠ»ΡΡΠ°Ρ, ΠΏΠΎΡΠΎΠΌΡ ΡΡΠΎ Π½Π°ΠΌ Π½ΡΠΆΠ½Ρ Π²Π΅ΡΠΎΡΡΠ½ΠΎΡΡΠΈ.
Π’Π΅ΠΏΠ΅ΡΡ, ΠΊΠΎΠ³Π΄Π° Π½Π°ΡΠΈ Π²ΡΡ ΠΎΠ΄Π½ΡΠ΅ Π΄Π°Π½Π½ΡΠ΅ ΠΈΠΌΠ΅ΡΡ ΠΏΡΠ°Π²ΠΈΠ»ΡΠ½ΡΠΉ ΡΠΎΡΠΌΠ°Ρ, Π΄Π°Π²Π°ΠΉΡΠ΅ ΠΏΠΎΡΠΌΠΎΡΡΠΈΠΌ, ΠΊΠ°ΠΊ ΠΌΡ Π½Π°ΡΡΡΠ°ΠΈΠ²Π°Π΅ΠΌ Π΄Π»Ρ ΡΡΠΎΠ³ΠΎ ΡΡΠ½ΠΊΡΠΈΡ ΠΏΠΎΡΠ΅ΡΡ. Π₯ΠΎΡΠΎΡΠΎ ΡΠΎ, ΡΡΠΎ ΡΡΠ½ΠΊΡΠΈΡ ΠΏΠΎΡΠ΅ΡΡ ΠΏΠΎ ΡΡΡΠΈ ΡΠ°ΠΊΠ°Ρ ΠΆΠ΅, ΠΊΠ°ΠΊ Ρ Π΄Π²ΠΎΠΈΡΠ½ΠΎΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ. ΠΡ ΠΏΡΠΎΡΡΠΎ ΠΏΡΠΈΠΌΠ΅Π½ΠΈΠΌ ΠΠΎΠ³Π°ΡΠΈΡΠΌΠΈΡΠ΅ΡΠΊΡΡ ΠΏΠΎΡΠ΅ΡΡ (Log Loss) ΠΊ ΠΊΠ°ΠΆΠ΄ΠΎΠΌΡ Π²ΡΡ ΠΎΠ΄Π½ΠΎΠΌΡ ΡΠ·Π»Ρ ΠΏΠΎ ΠΎΡΠ½ΠΎΡΠ΅Π½ΠΈΡ ΠΊ Π΅Π³ΠΎ ΡΠΎΠΎΡΠ²Π΅ΡΡΡΠ²ΡΡΡΠ΅ΠΌΡ ΡΠ΅Π»Π΅Π²ΠΎΠΌΡ Π·Π½Π°ΡΠ΅Π½ΠΈΡ, Π° Π·Π°ΡΠ΅ΠΌ Π½Π°ΠΉΠ΄Π΅ΠΌ ΡΡΠΌΠΌΡ ΡΡΠΈΡ Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ ΠΏΠΎ Π²ΡΠ΅ΠΌ Π²ΡΡ ΠΎΠ΄Π½ΡΠΌ ΡΠ·Π»Π°ΠΌ.
ΠΠ°ΡΠ΅Π³ΠΎΡΠΈΠ°Π»ΡΠ½Π°Ρ ΠΊΡΠΎΡΡ-ΡΠ½ΡΡΠΎΠΏΠΈΡ
ΠΡΠ° ΠΏΠΎΡΠ΅ΡΡ Π½Π°Π·ΡΠ²Π°Π΅ΡΡΡ ΠΊΠ°ΡΠ΅Π³ΠΎΡΠΈΠ°Π»ΡΠ½ΠΎΠΉ ΠΡΠΎΡΡ-ΡΠ½ΡΡΠΎΠΏΠΈΠ΅ΠΉ (Cross Entropy). Π’Π΅ΠΏΠ΅ΡΡ ΠΏΠ΅ΡΠ΅ΠΉΠ΄Π΅ΠΌ ΠΊ ΡΠ°ΡΡΠ½ΠΎΠΌΡ ΡΠ»ΡΡΠ°Ρ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ, Π½Π°Π·ΡΠ²Π°Π΅ΠΌΠΎΠΌΡ ΠΌΠ½ΠΎΠ³ΠΎΠ·Π½Π°ΡΠ½ΠΎΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠ΅ΠΉ.
ΠΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΡ ΠΏΠΎ Π½Π΅ΡΠΊΠΎΠ»ΡΠΊΠΈΠΌ ΠΌΠ΅ΡΠΊΠ°ΠΌ
ΠΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΡ ΠΏΠΎ Π½Π΅ΡΠΊΠΎΠ»ΡΠΊΠΈΠΌ ΠΌΠ΅ΡΠΊΠ°ΠΌ (MLC) Π²ΡΠΏΠΎΠ»Π½ΡΠ΅ΡΡΡ, ΠΊΠΎΠ³Π΄Π° Π½Π°ΡΠ΅ΠΉ ΠΌΠΎΠ΄Π΅Π»ΠΈ Π½Π΅ΠΎΠ±Ρ ΠΎΠ΄ΠΈΠΌΠΎ ΠΏΡΠ΅Π΄ΡΠΊΠ°Π·Π°ΡΡ Π½Π΅ΡΠΊΠΎΠ»ΡΠΊΠΎ ΠΊΠ»Π°ΡΡΠΎΠ² Π² ΠΊΠ°ΡΠ΅ΡΡΠ²Π΅ Π²ΡΡ ΠΎΠ΄Π½ΡΡ Π΄Π°Π½Π½ΡΡ . ΠΠ°ΠΏΡΠΈΠΌΠ΅Ρ, ΠΌΡ ΡΡΠ΅Π½ΠΈΡΡΠ΅ΠΌ Π½Π΅ΠΉΡΠΎΠ½Π½ΡΡ ΡΠ΅ΡΡ, ΡΡΠΎΠ±Ρ ΠΏΡΠ΅Π΄ΡΠΊΠ°Π·ΡΠ²Π°ΡΡ ΠΈΠ½Π³ΡΠ΅Π΄ΠΈΠ΅Π½ΡΡ, ΠΏΡΠΈΡΡΡΡΡΠ²ΡΡΡΠΈΠ΅ Π½Π° ΠΈΠ·ΠΎΠ±ΡΠ°ΠΆΠ΅Π½ΠΈΠΈ ΠΊΠ°ΠΊΠΎΠΉ-ΡΠΎ Π΅Π΄Ρ. ΠΠ°ΠΌ Π½ΡΠΆΠ½ΠΎ Π±ΡΠ΄Π΅Ρ ΠΏΡΠ΅Π΄ΡΠΊΠ°Π·Π°ΡΡ Π½Π΅ΡΠΊΠΎΠ»ΡΠΊΠΎ ΠΈΠ½Π³ΡΠ΅Π΄ΠΈΠ΅Π½ΡΠΎΠ², ΠΏΠΎΡΡΠΎΠΌΡ Π² Y Π±ΡΠ΄Π΅Ρ Π½Π΅ΡΠΊΠΎΠ»ΡΠΊΠΎ Π΅Π΄ΠΈΠ½ΠΈΡ.
ΠΠ»Ρ ΡΡΠΎΠ³ΠΎ ΠΌΡ Π½Π΅ ΠΌΠΎΠΆΠ΅ΠΌ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡ Softmax, ΠΏΠΎΡΠΎΠΌΡ ΡΡΠΎ ΠΎΠ½ Π²ΡΠ΅Π³Π΄Π° Π·Π°ΡΡΠ°Π²Π»ΡΠ΅Ρ ΡΠΎΠ»ΡΠΊΠΎ ΠΎΠ΄ΠΈΠ½ ΠΊΠ»Π°ΡΡ «ΡΡΠ°Π½ΠΎΠ²ΠΈΡΡΡΡ Π΅Π΄ΠΈΠ½ΠΈΡΠ΅ΠΉ», Π° Π΄ΡΡΠ³ΠΈΠ΅ ΠΊΠ»Π°ΡΡΡ ΠΏΡΠΈΠ²ΠΎΠ΄ΠΈΡ ΠΊ Π½ΡΠ»Ρ. ΠΠΌΠ΅ΡΡΠΎ ΡΡΠΎΠ³ΠΎ ΠΌΡ ΠΌΠΎΠΆΠ΅ΠΌ ΠΏΡΠΎΡΡΠΎ ΡΠΎΡ ΡΠ°Π½ΠΈΡΡ ΡΠΈΠ³ΠΌΠΎΠΈΠ΄ Π½Π° Π²ΡΠ΅Ρ Π·Π½Π°ΡΠ΅Π½ΠΈΡΡ Π²ΡΡ ΠΎΠ΄Π½ΡΡ ΡΠ·Π»ΠΎΠ², ΠΏΠΎΡΠΊΠΎΠ»ΡΠΊΡ ΠΏΡΡΠ°Π΅ΠΌΡΡ ΠΏΡΠ΅Π΄ΡΠΊΠ°Π·Π°ΡΡ ΠΈΠ½Π΄ΠΈΠ²ΠΈΠ΄ΡΠ°Π»ΡΠ½ΡΡ Π²Π΅ΡΠΎΡΡΠ½ΠΎΡΡΡ ΠΊΠ°ΠΆΠ΄ΠΎΠ³ΠΎ ΠΊΠ»Π°ΡΡΠ°.
Π§ΡΠΎ ΠΊΠ°ΡΠ°Π΅ΡΡΡ ΠΏΠΎΡΠ΅ΡΡ, ΠΌΡ ΠΌΠΎΠΆΠ΅ΠΌ Π½Π°ΠΏΡΡΠΌΡΡ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡ Π»ΠΎΠ³Π°ΡΠΈΡΠΌΠΈΡΠ΅ΡΠΊΠΈΠ΅ ΠΏΠΎΡΠ΅ΡΠΈ Π½Π° ΠΊΠ°ΠΆΠ΄ΠΎΠΌ ΡΠ·Π»Π΅ ΠΈ ΡΡΠΌΠΌΠΈΡΠΎΠ²Π°ΡΡ ΠΈΡ , Π°Π½Π°Π»ΠΎΠ³ΠΈΡΠ½ΠΎ ΡΠΎΠΌΡ, ΡΡΠΎ ΠΌΡ Π΄Π΅Π»Π°Π»ΠΈ Π² ΠΌΡΠ»ΡΡΠΈΠΊΠ»Π°ΡΡΠΎΠ²ΠΎΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ.
Π’Π΅ΠΏΠ΅ΡΡ, ΠΊΠΎΠ³Π΄Π° ΠΌΡ ΡΠ°ΡΡΠΌΠΎΡΡΠ΅Π»ΠΈ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΡ, ΠΏΠ΅ΡΠ΅ΠΉΠ΄Π΅ΠΌ ΠΊ ΡΠ΅Π³ΡΠ΅ΡΡΠΈΠΈ.
ΠΠΎΡΠ΅ΡΡ ΡΠ΅Π³ΡΠ΅ΡΡΠΈΠΈ
Π Π Π΅Π³ΡΠ΅ΡΡΠΈΠΈ (Regression) Π½Π°ΡΠ° ΠΌΠΎΠ΄Π΅Π»Ρ ΠΏΡΡΠ°Π΅ΡΡΡ ΠΏΡΠ΅Π΄ΡΠΊΠ°Π·Π°ΡΡ Π½Π΅ΠΏΡΠ΅ΡΡΠ²Π½ΠΎΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅, Π½Π°ΠΏΡΠΈΠΌΠ΅Ρ, ΡΠ΅Π½Ρ Π½Π° ΠΆΠΈΠ»ΡΠ΅ ΠΈΠ»ΠΈ Π²ΠΎΠ·ΡΠ°ΡΡ ΡΠ΅Π»ΠΎΠ²Π΅ΠΊΠ°. ΠΠ°ΡΠ° Π½Π΅ΠΉΡΠΎΠ½Π½Π°Ρ ΡΠ΅ΡΡ Π±ΡΠ΄Π΅Ρ ΠΈΠΌΠ΅ΡΡ ΠΎΠ΄ΠΈΠ½ Π²ΡΡ ΠΎΠ΄Π½ΠΎΠΉ ΡΠ·Π΅Π» Π΄Π»Ρ ΠΊΠ°ΠΆΠ΄ΠΎΠ³ΠΎ Π½Π΅ΠΏΡΠ΅ΡΡΠ²Π½ΠΎΠ³ΠΎ Π·Π½Π°ΡΠ΅Π½ΠΈΡ, ΠΊΠΎΡΠΎΡΠΎΠ΅ ΠΌΡ ΠΏΡΡΠ°Π΅ΠΌΡΡ ΠΏΡΠ΅Π΄ΡΠΊΠ°Π·Π°ΡΡ. ΠΠΎΡΠ΅ΡΠΈ ΡΠ΅Π³ΡΠ΅ΡΡΠΈΠΈ ΡΠ°ΡΡΡΠΈΡΡΠ²Π°ΡΡΡΡ ΠΏΡΡΠ΅ΠΌ ΠΏΡΡΠΌΠΎΠ³ΠΎ ΡΡΠ°Π²Π½Π΅Π½ΠΈΡ Π²ΡΡ ΠΎΠ΄Π½ΠΎΠ³ΠΎ ΠΈ ΠΈΡΡΠΈΠ½Π½ΠΎΠ³ΠΎ Π·Π½Π°ΡΠ΅Π½ΠΈΡ.
Π‘Π°ΠΌΠ°Ρ ΠΏΠΎΠΏΡΠ»ΡΡΠ½Π°Ρ ΡΡΠ½ΠΊΡΠΈΡ ΠΏΠΎΡΠ΅ΡΡ, ΠΊΠΎΡΠΎΡΡΡ ΠΌΡ ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠ΅ΠΌ Π΄Π»Ρ ΡΠ΅Π³ΡΠ΅ΡΡΠΈΠΎΠ½Π½ΡΡ ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ, β ΡΡΠΎ Π‘ΡΠ΅Π΄Π½Π΅ΠΊΠ²Π°Π΄ΡΠ°ΡΠΈΡΠ΅ΡΠΊΠ°Ρ ΠΎΡΠΈΠ±ΠΊΠ° (MSE). ΠΠ΄Π΅ΡΡ ΠΌΡ ΠΏΡΠΎΡΡΠΎ Π²ΡΡΠΈΡΠ»ΡΠ΅ΠΌ ΠΊΠ²Π°Π΄ΡΠ°Ρ ΡΠ°Π·Π½ΠΈΡΡ ΠΌΠ΅ΠΆΠ΄Ρ Y ΠΈ YPred ΠΈ ΡΡΡΠ΅Π΄Π½ΡΠ΅ΠΌ ΠΏΠΎΠ»ΡΡΠ΅Π½Π½ΠΎΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅.
Machine learning fundamentals (I): Cost functions and gradient descent
**This is part one of a series on machine learning fundamentals. ML fundamentals (II): Neural Networks can be found at https://towardsdatascience.com/machine-learning-fundamentals-ii-neural-networks-f1e7b2cb3eef**
Nov 27, 2017 Β· 8 min read
In this post Iβll use a simple linear regression model to explain two machine learning (ML) fundamentals; (1) cost functions and; (2) gradient descent. The linear regression isnβt the most powerful model in the ML tool kit, but due to its familiarity and interpretability, it is still in widespread use in research and industry. Simply, linear regression is used to estimate linear relationships between continuous or/and categorical data and a continuous output variable β you can see an example of this in a previous post of mine https://conorsdatablog.wordpress.com/2017/09/02/a-quick-and-tidy-data-analysis/.
As I go thro u gh this post, Iβll use X and y to refer to variables. If you prefer something more concrete (as I often do), you can imagine that y is sales, X is advertising spend and we want to estimate how advertising spend impacts sales. Visually, Iβll show how a linear regression learns the best line to fit through this data:
What does the machine learn?
One question that people often have when getting started in ML is:
βWhat does the machine (i.e. the statistical model) actually learn?β
This will vary from model to model, but in simple terms the model learns a function f such that f( X) maps to y. Put differentl y, the model learns how to take X (i.e. features, or, more traditionally, independent variable(s)) in order to predict y (the target, response or more traditionally the dependent variable).
In the case of the simple linear regression ( y
b0 + b1 * X where X is one column/variable) the model βlearnsβ (read: estimates) two parameters;
The bias is the level of y when X is 0 (i.e. the value of sales when advertising spend is 0) and the slope is the rate of predicted increase or decrease in y for each unit increase in X (i.e. how much do sales increase per pound spent on advertising). Both parameters are scalars (single values).
Once the model learns these parameters they can be used to compute estimated values of y given new values of X. In other words, you can use these learned parameters to predict values of y when you donβt know what y is β hey presto, a predictive model!
Learning parameters: Cost functions
There are several ways to learn the parameters of a LR model, I will focus on the approach that best illustrates statistical learning; minimising a cost function.
Remember that in ML, the focus is on learning from data. This is perhaps better illustrated using a simple analogy. As children we typically learn what is βrightβ or βgoodβ behaviour by being told NOT to do things or being punished for having done something we shouldnβt. For example, you can imagine a four year-old sitting by a fire to keep warm, but not knowing the danger of fire, she puts her finger into it and gets burned. The next time she sits by the fire, she doesnβt get burned, but she sits too close, gets too hot and has to move away. The third time she sits by the fire she finds the distance that keeps her warm without exposing her to any danger. In other words, through experience and feedback (getting burned, then getting too hot) the kid learns the optimal distance to sit from the fire. The heat from the fire in this example acts as a cost function β it helps the learner to correct / change behaviour to minimize mistakes.
In ML, cost functions are used to estimate how badly models are performing. Put simply, a cost function is a measure of how wrong the model is in terms of its ability to estimate the relationship between X and y. This is typically expressed as a difference or distance between the predicted value and the actual value. The cost function (you may also see this referred to as loss or error.) can be estimated by iteratively running the model to compare estimated predictions against βground truthβ β the known values of y.
The objective of a ML model, therefore, is to find parameters, weights or a structure that minimises the cost function.
Minimizing the cost function: Gradient descent
Now that we know that models learn by minimizing a cost function, you may naturally wonder how the cost function is minimized β enter gradient descent. Gradient descent is an efficient optimization algorithm that attempts to find a local or global minima of a function.
Gradient descent enables a model to learn the gradient or direction that the model should take in order to reduce errors (differences between actual y and predicted y). Direction in the simple linear regression example refers to how the model parameters b0 and b1 should be tweaked or corrected to further reduce the cost function. As the model iterates, it gradually converges towards a minimum where further tweaks to the parameters produce little or zero changes in the loss β also referred to as convergence.
Observing learning in a linear regression model
To observe learning in a linear regression, I will set the parameters b0 and b1 and will use a model to learn these parameters from the data. In other words, we know the ground truth of the relationship between X and y and can observe the model learning this relationship through iterative correction of the parameters in response to a cost (note: the code below is written in R).
Here I define the bias and slope (equal to 4 and 3.5 respectively). I also add a column of ones to X (for the purposes of enabling matrix multiplication). I also add some Gaussian noise to y to mask the true parameters β i.e. create errors that are purely random. Now we have a dataframe with two variables, X and y, that appear to have a positive linear trend (as X increases values of y increase).
Next I define the learning rate β this controls the size of the steps taken by each gradient. If this is too big, the model might miss the local minimum of the function. If it too small, the model will take a long time to converge (copy the code and try this out for yourself!). Theta stores the parameters b0 and b1, which are initialized with random values (I have set these these both to 20, which is suitably far away from the true parameters). The n_iterations value controls how many times the model will iterate and update values. That is, how many times the model will make predictions, calculate the cost and gradients and update the weights. Finally, I create some placeholders to catch the values of b0, b1 and the mean squared error (MSE) upon each iteration of the model (creating these placeholders avoids iteratively growing a vector, which is very inefficient in R).
The MSE in this case is the cost function. It is simply the mean of the squared differences between predicted y and actual y (i.e. the residuals)
Now, we run the loop. On each iteration the model will predict y given the values in theta, calculate the residuals, and then apply gradient descent to estimate corrective gradients, then will update the values of theta using these gradients β this process is repeated 100 times. When the loop is finished, I create a dataframe to store the learned parameters and loss per iteration.
When the iterations have completed we can plot the lines than the model estimated.
The first thing to notice is the thick red line. This is the line estimated from the initial values of b0 and b1. You can see that this doesnβt fit the data points well at all and because of this it is has the highest error (MSE). However, you can see the lines gradually moving toward the data points until a line of best fit (the thick blue line) is identified. In other words, upon each iteration the model has learned better values for b0 and b1 until it finds the values that minimize the cost function. The final values that the model learns for b0 and b1 are 3.96 and 3.51 respectively β so very close the parameters 4 and 3.5 that we set!
Voilla! Our machine! it has learned!!
We can also visualize the decrease in the SSE across iterations of the model. This takes a steep decline in the early iterations before converging and stabilizing.
We can now use the learned values of b0 and b1 stored in theta to predict values y for new values of X.
Summary
This post presents a very simple way of understanding machine learning. It goes without saying that there is a lot more to ML, but gaining an initial intuition for the fundamentals of what is going on βunderneath the hoodβ can go a long way toward improving your understanding of more complex models.
Cost Function is No Rocket Science!
This article was published as a part of the Data Science Blogathon.
The 2 main questions that popped up in my mind while working on this article were βWhy am I writing this article?β & βHow is my article different from other articles?β Well, the cost function is an important concept to understand in the fields of data science but while pursuing my post-graduation, I realized that the resources available online are too general and didnβt address my needs completely.
I had to refer to many articles & see some videos on YouTube to get an intuition behind cost functions. As a result, I wanted to put together the βWhat,β βWhen,β βHow,β and βWhyβ of Cost functions that can help to explain this topic more clearly. I hope that my article acts as a one-stop-shop for cost functions!
Dummies guide to the Cost function π€·ββοΈ
Loss function: Used when we refer to the error for a single training example.
Cost function: Used to refer to an average of the loss functions over an entire training dataset.
But, like, *why* use a cost function?
Why on earth do we need a cost function? Consider a scenario where we wish to classify data. Suppose we have the height & weight details of some cats & dogs. Let us use these 2 features to classify them correctly. If we plot these records, we get the following scatterplot:
Fig 1: Scatter plot for height & weight of various dogs & cats
Blue dots are cats & red dots are dogs. Following are some solutions to the above classification problem.
Fig: Probable solutions to our classification problem
Essentially all three classifiers have very high accuracy but the third solution is the best because it does not misclassify any point. The reason why it classifies all the points perfectly is that the line is almost exactly in between the two groups, and not closer to any one of the groups. This is where the concept of cost function comes in. Cost function helps us reach the optimal solution. The cost function is the technique of evaluating βthe performance of our algorithm/modelβ.
It takes both predicted outputs by the model and actual outputs and calculates how much wrong the model was in its prediction. It outputs a higher number if our predictions differ a lot from the actual values. As we tune our model to improve the predictions, the cost function acts as an indicator of how the model has improved. This is essentially an optimization problem. The optimization strategies always aim at βminimizing the cost functionβ.
Types of the cost function
There are many cost functions in machine learning and each has its use cases depending on whether it is a regression problem or classification problem.
1. Regression cost Function:
Regression models deal with predicting a continuous value for example salary of an employee, price of a car, loan prediction, etc. A cost function used in the regression problem is called βRegression Cost Functionβ. They are calculated on the distance-based error as follows:
Yβ β Predicted output
The most used Regression cost functions are below,
1.1 Mean Error (ME)
1.2 Mean Squared Error (MSE)
MSE = (sum of squared errors)/n
1.3 Mean Absolute Error (MAE)
So in this cost function, MAE is measured as the average of the sum of absolute differences between predictions and actual observations.
MAE = (sum of absolute errors)/n
It is robust to outliers thus it will give better results even when our dataset has noise or outliers.
2. Cost functions for Classification problems
Cost functions used in classification problems are different than what we use in the regression problem. A commonly used loss function for classification is the cross-entropy loss. Let us understand cross-entropy with a small example. Consider that we have a classification problem of 3 classes as follows.
Class(Orange,Apple,Tomato)
The machine learning model will give a probability distribution of these 3 classes as output for a given input data. The class with the highest probability is considered as a winner class for prediction.
The actual probability distribution for each class is shown below.
If during the training phase, the input class is Tomato, the predicted probability distribution should tend towards the actual probability distribution of Tomato. If the predicted probability distribution is not closer to the actual one, the model has to adjust its weight. This is where cross-entropy becomes a tool to calculate how much far the predicted probability distribution from the actual one is. In other words, Cross-entropy can be considered as a way to measure the distance between two probability distributions. The following image illustrates the intuition behind cross-entropy:
FIg 3: Intuition behind croos-entropy (credit β machinelearningknowledge.ai )
This was just an intuition behind cross-entropy. It has its origin in information theory. Now with this understanding of cross-entropy, let us now see the classification cost functions.
2.1 Multi-class Classification cost Functions
This cost function is used in the classification problems where there are multiple classes and input data belongs to only one class. Let us now understand how cross-entropy is calculated. Let us assume that the model gives the probability distribution as below for βnβ classes & for a particular input data D.
And the actual or target probability distribution of the data D is
Then cross-entropy for that particular data D is calculated as
Cross-entropy loss(y,p) = β y T log(p)
Cross-Entropy(y,P) = β (0*Log(0.1) + 0*Log(0.3)+1*Log(0.6)) = 0.51
The above formula just measures the cross-entropy for a single observation or input data. The error in classification for the complete model is given by categorical cross-entropy which is nothing but the mean of cross-entropy for all N training data.
Categorical Cross-Entropy = (Sum of Cross-Entropy for N data)/N
2.2 Binary Cross Entropy Cost Function
Binary cross-entropy is a special case of categorical cross-entropy when there is only one output that just assumes a binary value of 0 or 1 to denote negative and positive class respectively. For example-classification between cat & dog.
Let us assume that actual output is denoted by a single variable y, then cross-entropy for a particular data D is can be simplified as follows β
Cross-entropy(D) = β y*log(p) when y = 1
Cross-entropy(D) = β (1-y)*log(1-p) when y = 0
The error in binary classification for the complete model is given by binary cross-entropy which is nothing but the mean of cross-entropy for all N training data.
Binary Cross-Entropy = (Sum of Cross-Entropy for N data)/N
Conclusion
I hope you found this article helpful! Let me know what you think, especially if there are suggestions for improvement. You can connect with me on LinkedIn: https://www.linkedin.com/in/saily-shah/ and hereβs my GitHub profile: https://github.com/sailyshah
The media shown in this article are not owned by Analytics Vidhya and is used at the Authorβs discretion.