비전공생의 self-attention 코드 구현

Notice

안녕하세요

Recent Posts

Recent Comments

Link

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

Hello Data

비전공생의 self-attention 코드 구현 본문

Generative

비전공생의 self-attention 코드 구현

지웅쓰 2023. 1. 4. 11:38

Transformer 구조에서 이루고 있는 self-attention 코드를 구현해보려고 합니다.

자료 참고 : ratsgo님 블로그 나동빈님 유튜브위키독스

기존 RNN 구조를 제거하였지만 여전히 인코더/디코더 구조는 유지하고 있으며 인코더와 디코더 두 곳에서 모두 self-attention이 이루어집니다.

self attention이란 Query, Key, Value값의 출처가 같은 것을 의미합니다.(모두 인코더에서 온다/ 디코더에서 온다)

첫번째로는 인코더 파트에서 self-attention이 이루어집니다. 보통 여러개의 레이어가 있고 레이어마다 각각 수행이 됩니다.

두번째로는 여러개의 인코더 파트에서의 attention score가 넘어오기 전 디코더 파트에서도 self-attention 이 이루어집니다.

세번째 attention은 query는 디코더파트이고 key, value는 인코더파트이기 때문에 self-attention은 아님을 확인할 수 있습니다.

## 단어 수 :3 , 임베딩 : 8, head : 2개

word = 3
num_embedding = 8
num_head = 2


x = torch.rand((word,num_embedding))
print(x)

tensor([[0.3966, 0.7255, 0.1215, 0.0704, 0.3680, 0.3582, 0.9615, 0.8108],
        [0.3549, 0.9033, 0.9023, 0.8536, 0.6446, 0.4078, 0.0238, 0.7116],
        [0.0904, 0.0051, 0.0394, 0.9383, 0.0132, 0.0551, 0.7998, 0.6727]])

문장에서의 단어가 3개이며 단어의 차원을 8차원으로 설정하였고, 보통 단순 한개의 attention을 수행하는 것이 아닌 여러개의 attention을 수행해

다양한 관점에서 attention score를 계산하므로 2개의 head로 설정하였다.

w_query = torch.rand(num_embedding,(num_embedding // num_head))
w_key = torch.rand(num_embedding,(num_embedding // num_head))
w_value = torch.rand(num_embedding,(num_embedding // num_head))


print('----W-쿼리----')
print(w_query)
print()
print('Shape : ', w_query.shape)

print('----W-키----')
print(w_key)
print()
print('Shape : ', w_key.shape)

print('----W-value----')
print(w_value)
print()
print('Shape : ', w_value.shape)

----W-쿼리----
tensor([[0.8031, 0.2731, 0.1293, 0.3592],
        [0.7070, 0.1568, 0.9104, 0.0027],
        [0.3775, 0.3731, 0.6413, 0.3449],
        [0.4520, 0.2934, 0.3146, 0.9926],
        [0.2310, 0.6461, 0.6517, 0.9183],
        [0.2305, 0.0493, 0.2628, 0.2444],
        [0.1998, 0.5768, 0.3460, 0.1216],
        [0.4348, 0.0108, 0.4149, 0.6752]])

Shape :  torch.Size([8, 4])
----W-키----
tensor([[0.8440, 0.7468, 0.7503, 0.0340],
        [0.6032, 0.1522, 0.5199, 0.7309],
        [0.2501, 0.4626, 0.2266, 0.2562],
        [0.9972, 0.0717, 0.3277, 0.0338],
        [0.8026, 0.4364, 0.2287, 0.8022],
        [0.8233, 0.8306, 0.7460, 0.6580],
        [0.9853, 0.0867, 0.2916, 0.4633],
        [0.2422, 0.4650, 0.6861, 0.4646]])

Shape :  torch.Size([8, 4])
----W-value----
tensor([[0.1142, 0.0705, 0.4170, 0.6105],
        [0.4299, 0.7516, 0.0975, 0.9225],
        [0.4545, 0.4371, 0.4890, 0.4384],
        [0.0010, 0.6873, 0.0655, 0.3182],
        [0.3343, 0.7860, 0.2122, 0.7984],
        [0.6955, 0.3079, 0.5558, 0.6913],
        [0.5279, 0.7236, 0.9419, 0.5269],
        [0.9223, 0.6109, 0.1507, 0.5908]])

Shape :  torch.Size([8, 4])

query, key, value 의 가중치들의 크기는 (임베딩 수, 임베딩 수/ head 수)로 이루어진다.

##query, key, value 만들어주기
query = torch.matmul(x, w_query)
key = torch.matmul(x, w_key)
value = torch.matmul(x, w_value)
print('---쿼리---')
print(query)
print()
print('---키---')
print(key)
print()
print('---value---')
print(value)

---쿼리---
tensor([[1.6214, 1.1068, 1.8149, 1.3461],
        [2.2072, 1.2836, 2.5461, 2.4634],
        [0.9833, 0.7953, 0.9158, 1.5544]])

---키---
tensor([[2.6069, 1.3864, 1.9134, 1.9302],
        [2.9701, 1.8340, 2.1668, 2.0594],
        [2.0317, 0.5874, 1.1258, 0.7786]])

---value---
tensor([[2.0400, 2.2653, 1.6052, 2.5141],
        [2.0078, 2.7691, 1.2263, 2.9467],
        [1.1168, 1.6894, 1.0071, 1.2432]])

가중치와 벡터 시퀀스를 이용해 query, key, value를 구해준다. 만약 이것이 인코더 파트에서의 self-attention이라면

이 모든 출처가 인코더파트에서 나왔으므로 self-attention이다.

##softmax 계산하기

softmax_score = (softmax((torch.matmul(query, key.T) / np.sqrt(num_embedding)), dim = -1))
softmax_score

tensor([[0.3312, 0.6080, 0.0608],
        [0.2970, 0.6792, 0.0238],
        [0.3612, 0.5416, 0.0972]])

이렇게 구해진 query 값과 key값을 내적하며(key값을 Transpose), 이 값을 차원의 루트를 씌워 나누어준다.

그리고 이 값들에 softmax를 취해준다. 결국 각 단어 시퀀스 벡터에 대해서 얼마나 영향 받는지 softmax 값을 구한다.

#self attention score 계산하기

torch.matmul(softmax_score, value) ## -> head가 2개이므로 concat 을 이용해 붙인다면 처음 만든 벡터 sequence와 크기가 같다.

tensor([[1.9643, 2.5367, 1.3385, 2.6999],
        [1.9962, 2.5938, 1.3336, 2.7776],
        [1.9328, 2.4822, 1.3418, 2.6248]])

softmax값에 value값을 구해 attention score값을 구해준다.

여기서 head가 2개이므로 다른 head에서의 attention score값을 concat해서 붙여주면 기존의 단어 벡터 임베딩과 같은 크기를 가지게 된다.

self-attention에서 학습 대상은 query, key, value값을 바꿀 수 있는 가중치 값들이다.

이 값들은 작업의 목적에 맞게(번역 or 분류) 업데이트 된다.

2023.3.22 추가)

Attention vs Self-Attention 비교해보기

Attention같은 경우 Decoder의 Query를 활용하고 Encoder에서의 key, value를 활용해 Attention score맵을 구하는 것이다.

self-Attention 같은 경우 Decoder 혹은 Encoder안에서 Query, Key, Value 값을 이용해 맵을 구하는 것이다. 즉 데이터 내 상관관계를 바탕으로 특징을 추출하는 방법이다.

자세한 참고

https://www.youtube.com/watch?v=WsQLdu2JMgI

https://www.youtube.com/watch?v=bgsYOGhpxDc&t=1402s

저작자표시 비영리 변경금지

'Generative' 카테고리의 다른 글

비전공생의 Attention is All you Need(2017)논문 리뷰 (0)	2023.04.04
비전공생의 Attention(Neural Machine Translation By Jointly Learning to Align and translate, 2015) (0)	2023.01.03
비전공생의 AdaIN(Arbitrary Transfer in Real-time with Adaptive Instance Normalization, 2017)논문 리뷰 (0)	2022.12.29
비전공생의 Style Transfer(Image Style Transfer Using Convolutional Neural Networks, 2016) 논문 리뷰 (0)	2022.12.27
비전공생의 Pix2Pix(Image to Image translation with Conditional Adversarial Networks, 2016) 코드 구현 (0)	2022.11.28

'Generative' Related Articles

Hello Data

비전공생의 self-attention 코드 구현 본문

비전공생의 self-attention 코드 구현

'Generative' 카테고리의 다른 글

티스토리툴바