[35] CS2241N: Lecture 5 Recurrent Neural Networks and Language Models 정리

Table of Contents

A bit more about NNs

앞의 강의 추가설명은 앞의 강의에 넣습니다.

3
저는 전환 기반 의존성 파서에 대한 아이디어를 소개했습니다. 자연어 텍스트의 구문 구조를 제공하기 위한 효율적인 선형 기반 방법이었다.

그러나 그들은 몇 가지 단점이 있었고,

그들의 가장 큰 단점은 그 시대의 대부분의 ML 모델들처럼 indicator features를 갖고 작업했다.
즉, 즉 일부 조건을 지정한 다음 구성에 해당하는지 여부를 확인하는 것임
따라서 그택 맨 위에 있는 단어와 같은 것이 좋다.
그리고 그것의 품사는 형용사이거나 다음 단어는 인칭대명사입니다.

이러한 조건들은 기존의 전환 기반에서 특징일 수 있다.

종속 구문 분석기 그래서 그렇게 하는 것의 문제점은 무엇일까요?
자, 한 가지 문제는 그러한 기능들이 매우 희박하다는 것입니다. 두 번째 문제는 기능이 불완전하다는 것이다. ➡ 훈련 데이터에서 어떤 단어와 구성에 발생했는지에 따라 동사 앞에 특정 단어를 보았기 떄문에 존재할 특정기능과 그 단어 떄뭄넹 존재하지 않을 특정한 특징들이 있다.

세번쨰는 경우의 수가 너무 많기 떄문에 이러한 모든 기능을 계산하는 것은 expensive
비록 실제 전환 시스템이 지난 번에 보여드린 것처럼 빠르고 효율적으로 실행할 수 있었지만
실제로 이 모든 기능을 계산해야 합니다.

그리고 당신이 발견한 것은 이 모델들 중 하나의 구문 분석 시간의 약 95%가 모든 구성의 모든 기능을 계산하는 데만 사용되었습니다. ➡ 이것은 아마도 우리가 조밀하고 콤팩트한 특징 표현을 배울 수 있는 신경학적 접근법으로 더 잘 할 수 있다는 것을 암시합니다.

따라서 이번에는 정확히 동일한 스택 및 버퍼 구성 정확히 같은 전환 시퀀스를 실행하고 있습니다.
그래서, 수백만개의 기호를 사용하여 스택과 버퍼로 비효율 적인 구성을 나타내기 보다는

[2014년 Danqi Chen]
우리는 대신 이 구성을 대략 1000개 정도의 밀집 벡터로 요약할것이다,

그리고 우리의 신경 접근법은 이 밀집된 컴팩트한 특징 표현을 학습할 것입니다.

그래서 아주 명백하게, 지금부터 제가 간단히 보여드릴 것은 2014년 Danqi Chen가 개발한 neural dependency parser이다.

그결과를 보면 기존 방법보다 LAS거 약 2% 더 정확한 것을 만들어 낼 수 있다는 것입니다

그리고 그것이 모든 상징적 특징 계산을 하지 않는다는 사실 때문에 처음에는 실수 수학과 행렬 벡터 곱셈이 많다고 생각할지 모르지만 신경 의존성 파서에서, 그것은 실제로 상징적인 의존성 파서보다 눈에 띄게 빠르게 실행되었다.
모든 기능 연산을 가지고 있지 않았기 때문입니다. 의존성에 대한 다른 주요 접근법

그리고 MSTParser, TurboParser 는 나머지보다 더 정확하지만 100배더 빠르다. 이렇게 될 수 있는 이유 -> 매우 간단한 구현// 아래

[ppt에는 없는듯]
빠를 수 있었던 이유
word embedding 분산표현 사용➡ 단어를 단어 embeddings으로 표현
특정 구성에서 단어가 표시되지 않은 경우에도 올바른 구성에 유사한 단어를 보았기 때문에 단어가 어떤 것인지 알 수 있음

의존성 파서의 중심이 되는 다른 것들은 품사와 의존성 레이블이다.
그래서 우리가 하기로 결정한 것은 그것들이 훨씬 더 작은 집합이지만 종속성 레이블 수는 약 40개이고 품사는 그정도 규모로 떄로는 더 적게, 떄로는 더 많이 심지어 이러한 집한 내에서도 카테고리에는 매우 밍접한 관련이 있는 카테고리가 있음
그래서 우리는 이것을 분산표현을 위해 채택함

예를 들어 단수 명사와 복수 명사에 품사가 있을 수 있다.
기본적으로 이것들은 대부분 그들은 비슷하게 행동하고 형용사 수식어와 숫자 수식어가 있는데 이것들은 단지 3, 4, 5와 같은 숫자입니다.
그리고 다시 말하지만, 그들은 대부분 똑같이 행동합니다. 여러분은 세 마리의 소와 갈색 소를 모두 가지고 있습니다.

자 모든것이 분산표현으로 표현될 것이다.
그래서 그 시점에서 우리는 우리의 스택과 버퍼가 있고 약간의 종속 화살표를 만들기 시자갛ㄴ 것과 똑같은 종류의 구성을 갖고있다.

따라서 다음 전환의 분류결정은 이 구성의 몇 가지 요소로 이루어집니다.
스택의 맨 위, 스택의 두 번째, 버퍼의 첫 번째 단어를 살펴봅니다.
그리고 나서 우리는 이미 실제로 몇 가지 추가 기능을 추가했습니다. 이미 단어들을 위한 종속 그래프를 만들었을 정도로요.

각각의 것들에는 단어가 있고 품사가 있다.
그리고 그들 중 일부는 이미 의존성을 가지고 있습니다.

예를 들어, S2의 왼쪽 모서리에는 스택의 두 번째 항목으로 되돌아가는 하위 비독립성을 가지고 있습니다.
그래서 우리는 이러한 구성 요소들을 가지고 각각의 임베딩을 찾아볼 수 있습니다.
그래서 우리는 단어 임베딩, 음성 임베딩의 일부, 의존성 임베딩을 가지고 있습니다.
그리고 그것들을 모두 연결하기만 하면 됩니다.
우리가 전에 창과 함께 했던 것처럼 말이죠.
분류기를 사용하면 구성에 대한 신경 표현을 얻을 수 있습니다.

이제, 우리가 딥 러닝 분류기를 사용하여 예측함으로써 승리를 기대할 수 있는 두 번째 이유가 있다. 소프트맥스 분류기이다.
그거에 대해서 그래서 우리가 신경모델에서 얘기했던 것과 비슷한 가장 간단한 분류기입니다. 만약 우리가 d차원 벡터 x와

물건을 할당할 y 클래스, 또한 y물건을 할당하기 위한 c클래스 집합의 요소이다.

이전에 보았던 소프트맥스 분포를 사용하여 소프트맥스 분류기를 만들 수 있습니다.

여기서 우리는 c x d인 가중치 행렬을 기준으로 클래스를 결정한다.
그리고 우리는 이전에 본 음의 로그 가능성 손실을 최소화 하시위래 이 w, 가중치 행렬의 값을 supervised data에 대해 train 한다.

손실은 일반적으로 교차 엔트로피 손실 이라고 함

그것은 간단한 기계 학습 분류기입니다.

그러나 이와 같은 간단한 소프트맥스 분류기는 Navy Bayes 모델을 포함하는 일부 모델, 벡터 머신, 로지스틱 회귀 분석과ㅏ 같은 대부분의 전통적인 기계 학습과 공유한다.

결국, 그들은 매우 강력한 분류기가 아닌 그들은 선형 결정 경계만을 주는 분류자입니다.

그래서 이것은 상당히 제한적일 수 있습니다.
그래서 만약 제가 왼쪽 아래에 있는 그림처럼 어려운 문제가 있다면, 단순히 직선을 그리는 것만으로 녹색 점과 빨간색 점을 구분할 수 있는 방법은 없습니다.
그래서 여러분은 꽤 불완전한 분류기를 갖게 될 것입니다.

그래서 신경 분류기의 두 번째 큰 승리는
비선형 분류를 제공할 수 있기 때문에 훨씬 더 강력할 수 있다는 것입니다. 13:32 그래서 왼쪽 그림처럼 무언가를 할 수 있는 것 보다, 우리는 무언가를 할 수 있는 분류기를 생각해 낼 수 있다.

그런데 주의할점은 그럼에도 우리는 softmax를 사용하고 있는데 이유는 이 것 밑에는 신경망의 다른 여러 종류의 계층을 쓰고 마지막 출력층에만 softmax를 사용하기 때문이다.
그런데 이 아래 신경망이 공간을 왜곡하고 ( 공간을 휘게 만드는) 그리고 데이터 포인트의 표현을 이동시켜 결국에 무언가를 제공합니다.
그러므로 마지막에 선형 분유기로 분류해ㄱ도 분류가 적정하게 되어짐
이것이 바로 간단한 피드포워드입니다.
즉 결정은 상위 소프트맥스에 관한 한 선형이지만 원래 표현 공간에서는 비선형적이다.

simple feed-foeward NN multi-class classifier
따라서 입력 표현으로 시작합니다.
이것들은 입력의 밀집 표현이다.

비선형이 뒤따르는 행렬 곱을 사용하여 은닉층 h를 통과시킨다.
행렬 곱셈은 공간과 지도를 변형시키고 사물을 매핑한다.

그런 다음 출력을 소프트맥스 레이어에 넣고 분류 결정을 내릴 수 있는 소프트맥스 확률을 얻을 수 있습니다.
그리고 확률로 올바른 클래스에 할당되지 않는 한 로그 손실 또는 교차 엔트로피 오류가 발생합니다.
이ㅣ 오류는 매개 변수와 임베딩으로 back propagation된다.
그리고 역전차를 통해 진행되는 학습으로 우리는 입력을 다시 표현하는 방법을 배우는 모델의 이 은닉층의 매개변수를 점점 더 배우게 될 것이다.
중간 h를 거펴 입력으로 이동하여 매개변ㅅ들에게 오률ㄹ 전달하는 것이다.

하지만 만약 우리가 시각 신호 같은 것을 가지고 있다면 그냥 여기 있는 실제 숫자대로만 알려드리면 끝이다.

하지만 보통 인간의 언어 자료로는 이 입력 벡터 전에 한 개의 레이어를 더 넣었는데, 그 이유는 이 밀집된 입력층 아래에 각 단어나 품사에 대한 정보를 알려주기 위해서 더 추가하는 것이다,

여기 제 사진에서 또 다른 점은 다른 비선형성을 도입했다는 것입니다.
은닉 레이어에 사용하고 각 노드의 활성화 여부ㅜ를 결정해주는 것이다.
이는 나중에 다시 설명할 것이다.

Neural Dependency Parser Model Architecture 좋아, 그래서 우리의 신경망 의존성 파서 모델 구조는 본질적으로 피드포와 똑같지만
전환기반종속성 파서의 구성으로 이동합니다.

따라서 전환 기반 종속성 파서 구성을 기반으로 입력 계층을 구성합니다.
이는 이전에 논의한 것처럼 다양한 요소를 찾아봄으로써 입력 레이어 임베딩을구성한다.

그런 다음 이 은닉틍을 통해 softmax레이어에 공급하여 다음 작업이 무엇인지 선택할 수 있는 확률(파란색 네모(을 얻습니다.
다음 행동을 선택할 수 있는 확률을 얻기 위한 레이어입니다.
그리고 그것은 그것보다 더 복잡하지 않다

[ppt에 없는듯]

. 하지만 우리가 발견한 것은 단순히 어떤 의미에서는

가장 간단한 종류의 피드포워드를 사용하여, 신경 분류기가 의미해석을 지원하는 문장구조를 결정하는 매우 정확한 종속성 파서를 제공할 수 있다.

사실 그것이 꽤 단순한 아키텍처라는 사실에도 불구하고,
2014년에 이것은 최초의 성공적인 신경 의존성 파서였다.

그리고 특히 조밀한 표현은 분류기의 부분적으로 비선형성이 있기 떄문에 정확도, 속도 면에서 symbolic parsers를 능가할 수 있다는 좋은 결과를 우리에게 주었다.

그 이후에 많은 사람들이 이 neural dependency parser 에 열광했고 이에대한 더 ㅏㄴㅎ은 정보는 아래 슬라이드를 확인해

그 후 일어난 일에 대해서요 그래서 많은 사람들이 성공에 흥분했다.

신경 의존성 파서와 많은 사람들, 특히 구글에서, 그리고 나서

더 멋진 전환 기반 신경 의존성 파서입니다. 그래서 그들은 더 크고, 더 깊은 네트워크를 탐험했습니다.

숨겨진 레이어가 하나만 있을 이유는 없습니다. 두 개의 숨겨진 레이어를 가질 수 있습니다. 당신은 제가 저번에 잠깐 언급했던 빔 검색을 할 수 있습니다.

이제 제가 언급하지 않을 또 다른 것은 조건부 무작위 필드 스타일 추론을 추가하는 것입니다.

의사 결정 순서에 따라 그리고 2016년 모델로서 그들은 파시 맥파스페이스라고 불렀는데, 그것은 정색을 하고 말하기 어렵다. 2와 1/2로 모델보다 3% 더 정확했습니다. 우리가 만들어 낸 것. 하지만 여전히 신경망을 가진 전이 기반 파서의 기본적으로 같은 계열에 있다. 분류기를 사용하여 다음 전환을 선택합니다.

transition-based parsers의 대안은 graph-based dependency입니다.

graph-based dependency의 경우, 모든 단어 쌍을 효과적으로 고려하는 것입니다.

단어를 root의 종속어로 간주합니다. 그리고 너는 점수를 얻으려고 한다.
이 ‘big’이 root의 종속일 확률이 얼마나 되는지,
‘cat’이 될 확률이 얼마나 큰지에 대한 점수를 제시하고 있음
이렇게 종속일 확률을 알려면 관련된 두 단어가 무엇인지 그 이상을 알아야 한다.
➡ 이를 위해 문맥을 이해해야 한다.
각 쌍별 종속성을 점수화할 수 있다면, 우리는 단순히 가장 좋은 것을 선택할 수 있다 그래서 우리는 아마도 big’이 ‘cat’의 종속이라고 말할 수 있습니다.
그리고 당신의 첫 번째 근사치에서 우리는 , 각각의 단어에 대해 종속될 가능성이 가장 높은 단어의 종속항목을 선택하기를 원할 것이다.

그러나 우리는 지난시간에 논의한 바와 같이 단일 루트를 갖은 트리응 추출하기를 워ㅓㄴ하기 떄문에 몇가지 제약조건을 갖고 그렇게 하고싶음

그리고 당신은 서로 다른 종속성이 얼마나 가능성이 있는지에 대해 이러한 점수를 사용하는 minimum spanning tree 알고리즘을 사용하여 이를 수행할 수 있습니다.

따라서 문자으이 종속성 표현을위해 ‘big’의 문맥의 왼쪽 오른쪽에 대한 이해를 원한다,
따라서 이전 결과 슬라이드에서 보여드린 MST 파서처럼 그래프 기반 종속성 구문 분석에서 이전 작업이 있었지만,
우리가 문맥의 훨씬 더 나은 표현을 생각해 낼 수 있다는 것이 매력적으로 보였다.
문맥을 보는 신경망을 이용하는 거죠
그리고 우리가 어떻게 그렇게 할 것인가가 제가 강의의 마지막 부분에서 이야기 할 것입니다.

컨텍스트를 사용하여 더 나은 그래프 기반 종속성 파서를 만드는 방법을 알아냅니다.

A bit more about NNs

정규화

그래서 우리가 이 신경망을 만들 때 우리는 지금 엄청난 수의 매개 변수를 가진 모델을 만들고 있습니다.

따라서 기본적으로 잘 작동하는 모든 신경망 모델, 실제로 전체 손실 함수는 정규화된 손실 함수입니다.

J: 손실 함수
첫번쨰 ==은 이전에 소프트맥스 분류기를 사용하는 부분에서 본 적이 있는 부분입니다.
➡ 음의 로그 우도 손실을 가져와서 다른 예들에 대해 평균을 내고 있습니다

핑크색 네모: 정규화 ➡ 정규화 항은 모형의 모든 모수의 제곱을 합한 것입니다.
➡ 이 의미는 매개변수가 정말 유용할 떄만 매개변수를 0이 아닌 값으로 만들고 싶다는 것이다.
➡ 따라서 매개변수가 별로 도움이 되지 않는 한도에서 매개변수를 0이 아닌 값으로 만든다. .(매개변수가 도움이 되는 범위 내 에서 likelihood을 추정하는데 도움이 된다.)
특히 이 패널티는 매개 변수당 한 번만 평가됩니다. 각각의 예에 대해 개별적으로 평가되는 것은 아닙니다

CS231n첨부

. 그리고 이런 종류의 규칙성을 갖는 것은 26:15 정규화를 잘 하는 신경망 모델을 구축하기 위해서는 필수적이다. 26:21 그래서 고전적인 문제는 과적합이라고 불립니다. 그리고 과적합은 만약 당신이 26:27 특정 교육 데이터 세트를 가지고 있고 당신은 당신의 모델, 당신의 오류, 훈련을 시작합니다. 26:33 데이터에 대한 정답을 더 잘 예측할 수 있도록 매개 변수를 이동하기 때문에 중단됩니다. 26:41 모형상의 점들. 그리고 넌 계속 그럴 수 있어 그리고 그건 계속 될 거야 26:46 오류율을 줄이는데 도움이 됩니다. 하지만 부분적으로 훈련된 분류기를 보면 26:53 그리고, 이 분류기가 얼마나 잘 분류하는가? 26:58 독립적인 데이터, 모델을 교육하지 않은 다른 테스트 데이터? 27:04 특정 시점까지는 독립적인 테스트 사례를 분류하는 데 더 능숙하다는 것을 알게 될 것입니다. 27:11 as well. And after that, commonly what will happen is you’ll actually start to get worse 27:18 of classifying independent test examples, even though you’re continuing to get better at predicting 27:24 the training examples. And so this was then referred to as your overfitting 27:30 the training examples, that you’re fiddling the parameters of the model so they’re really good at predicting the training 27:36 examples, which aren’t useful things that can then predict on independent examples that you’d come to at run time. 27:47 OK. That classic view of regularization is sort of actually outmoded and wrong 27:55 for modern neural networks. So the right way to think of it for the kind 28:02 of modern big neural networks that we build is that overfitting on the training data isn’t a problem. 28:12 But nevertheless, you need regularization to make sure 28:18 that your models generalize well to independent test data. 28:23 So what you would like is for your graph not to look like this example with test error starting 28:31 to head up. You’d like to have it at worst case flat line 28:37 and best case still be gradually dropping. It will always be higher than the training error, 28:43 but it’s not actually showing a failure to generalize. So when we train big neural nets these days, 28:52 our big neural nets always overfit on the training data. 28:59 They hugely overfit on the training data. In fact, in many circumstances our neural nets 29:04 have so many parameters that you can continue to train them on the training data 29:09 until the error on the training data is zero. They get every single example right because they can just memorize enough stuff about it 29:17 to predict the right answer. But in general, providing the models are regularized well, 29:23 those models will still also generalize well and predict well in independent data. 29:30 And so for part of what we want to do for that is to work out how much to regularize. 29:36 And so this Lambda parameter here is the strength of regularization. 29:41 So if you’re making that Lambda number big, you’re getting more regularization. 29:47 And if you make it small, you’re getting less. And you don’t want to have it be too big or else you won’t fit the data well. 29:53 And you don’t want to be too small, or else you have the problem that you don’t generalize well. 30:01 OK. So this is classic L2 regularization and it’s a starting point. 30:06 That our big neural nets are sufficiently complex and have sufficiently many parameters that essentially L2 regularization doesn’t cut it. 30:15 So the next thing that you should know about and is a very standard good feature 30:21 for building neural nets is a technique called dropout. 30:26 So dropout is generally introduced 30:32 as a sort of a slightly funny process that you do when training to avoid feature co-adaptation. 30:40 So in dropout what you do is that the time that you’re training your model, that for each instance 30:50 or for each batch in your training, then for each neuron in the model, 30:58 you drop 50% of its inputs. You just treat them as zero. And so that you can do by sort of zeroing out 31:06 elements of the sort of layers. 31:12 And then at test time, you don’t drop any of the model weight. 31:17 You keep them all. But actually you have all the model weights because you’re now keeping twice as many things 31:23 as you used at training data. And so effectively, that little recipe 31:30 prevents what’s called feature co-adaptation. So you can’t have features that are only 31:39 useful in the presence of particular other features, because the model can’t guarantee 31:45 which features are going to be present for different examples. Because different features are being randomly dropped 31:51 all of the time. And so effectively dropout gives you a kind of a middle ground between Naive Bayes 31:58 and a logistic regression model. In Naive Bayes models, all the weights are set independently. 32:03 In a logistic regression model, all the weights are set in the context of all the others. And here you are aware of other weights, 32:10 but they can randomly disappear from you. It’s also related to ensemble models like model bagging, 32:17 because you’re using different subsets of the features every time. But after all of those explanations, 32:26 there’s actually another way of thinking about dropout which was actually developed here at Stanford, 32:31 this paper by Percy Liang and students, which is to argue that really what dropout gives 32:38 you is a strong regularizer that isn’t a uniform regularizer like L2 that regularizes everything 32:45 with an L2 loss, but can learn a feature dependent regularization. And so that dropout has just emerged 32:52 as in general the best way to do regularization for neural nets. 32:57 I think you’ve already seen and heard this one, but just have it on my slides once. 33:05 If you want to have your neural networks go fast, it’s really essential that you make 33:11 use of vectors, matrices, tensors and you don’t do things with for loops. 33:16 So here’s a teeny example, where I’m using timeit, which is a useful thing that you can use too 33:23 to see how fast your neural nets run in different ways of writing it. And so when I’m doing these dot products here, 33:36 I can either do the dot product in a for loop against each word 33:43 vector, or I can do the dot product with a single word vector matrix. 33:49 And if I do it in a for loop, doing each loop 33:54 takes me almost a second. Whereas if I do it with a matrix multiply, 34:02 it takes me an order of magnitude less time. So you should always be looking to use vectors and matrices, 34:09 not for loops. And this is a speed up of about 10 times when you’re doing things on a CPU. 34:16 Heading forward, we’re going to be using GPUs and they only further exaggerate the advantages 34:23 of using vectors and matrices, where you’ll commonly get two orders of magnitude speed up by doing things that way. 34:31 Yeah, so for the backward parse, you are running backward parses before on the 34:39 dropped out examples, right? So for the things that were dropped out 34:45 no gradient is going through them because they weren’t present. They’re not affecting things. 34:51 So in a particular batch, you’re only 34:56 training weights for the things that aren’t dropped out. But then since for each successive batch 35:03 you drop out different things that over a bunch of batches, you’re then training all of the weights of the model. 35:11 And so feature dependent regularizer is meaning that the different features can 35:21 be regularized different amounts to maximize performance. 35:29 So back in this model, every feature was just sort of being penalized by taking Lambda times that 35:38 squared value. So this is sort of uniform regularization, where 35:44 the end result of this dropout style training is that you end up with some features being regularized much 35:53 more strongly and some other features being regularized 35:58 less strongly. And how much they are regularized depends on how much they’re being used. 36:05 So you’re regularizing more features that are being used less. But I’m not going to get through into the details of how you can 36:13 understand that perspective. That’s outside of the context of what I’m 36:19 going to get through right now. So the final bit is I just wanted to give a little bit of perspective 36:26 on non-linearities in our neural nets. So the first thing to remember is 36:33 you have to have non-linearity. So if you’re building a multi-layer neural net 36:39 and you’ve just got W1x plus b1 then you put it through W2x 36:44 plus b2 and then put it through W3x– 36:49 Well, I guess they’re different hidden layers. So I shouldn’t say x. That should be hidden 1, hidden 2, hidden 3. 36:55 W3 hidden 3 plus b3. That multiple linear transformations 37:02 composed so they can be just collapsed down into a single linear transformation. So you don’t get any power as a data representation 37:14 by having multiple linear layers. There’s a slightly longer story there because you actually 37:20 do get some interesting learning effects, but I’m not going to talk about that now. But standardly, we have to have some kind of non-linearity 37:31 to do something interesting in a deep neural network. OK. 37:36 So a starting point is the most classic non-linearity 37:41 is the logistic. Often just called the sigmoid non-linearity because of its S shape, which we’ve seen 37:49 before in previous lectures. So this will take any real number and method 37:54 on to the range of 0, 1. And that was sort of basically what people used 38:02 in sort of 1980s neural nets. Now one disadvantage of this non-linearity 38:09 is that it’s moving everything into the positive space because the output is always between 0 and 1. 38:17 So people then decided that for many purposes, it was useful to have this variant sigmoid shape 38:24 of hyperbolic tan, which is then being shown in the second picture. 38:30 Now you know logistic and hyperbolic tan, they sound like they’re very different things. 38:36 But actually, as you maybe remember from a math class, hyperbolic tan can be represented in terms 38:42 of exponentials as well. And if you do a bit of math, which possibly we might make you do on an assignment, 38:49 it’s actually the case that a hyperbolic tangent is just a rescaled and shifted version of the logistic. 38:55 So it’s really exactly the same curve, just squeezed a bit. So it goes now symmetrically between minus 1 and 1. 39:03 Well, these kind of transcendental functions like hyperbolic tangent, they’re kind of 39:11 slow and expensive to compute, right, even on our fast computers, calculating exponentials is a bit slow. 39:18 So it’s something people became interested in was well, could we do things with much simpler non-linearities? 39:25 So what if we used a so-called hard tanh? So the hard tanh, up to some point 39:34 it just flatlines at minus 1, then it is y equals x up until 1. 39:42 And then it just flat lines again. And that seems a slightly weird thing 39:48 to use because if your input is over on the left or over on the right, you’re sort of not 39:55 getting any discrimination. And everything’s giving the same output. 40:00 And somewhat surprisingly, I mean, I was surprised when people started doing this. 40:06 These kind of models proved to be very successful. 40:11 And so that then led into what’s proven to be kind of the most successful and generally widely 40:18 used non-linearity in a lot of recent deep learning work, which is what was being used in the dependency powers 40:26 model I showed is what’s called the Rectified Linear Unit or ReLU. 40:31 So a ReLU is the simplest kind of non-linearity that you can imagine. So if the value of x is negative, its value is 0. 40:40 So effectively it’s just dead. It’s not doing anything in the computation. And if its value of x is greater than 0, then it’s just simply y 40:51 equals x. The value is being passed through. 40:57 And at first sight, this might seem really, really weird and how could this be useful as a non-linearity? 41:04 But if you sort of think a bit about how you can approximate things with piecewise linear functions very accurately, 41:11 you might kind of start to see how you could use this to do accurate function 41:17 approximation with piecewise linear functions. And that’s what ReLU units have been 41:22 found to do extremely, extremely successfully. So logistic and tanh are still used in various places. 41:31 You use logistic when you want a probability output. We’ll see tanh’s again very soon when we get 41:37 to recurrent neural networks. But they’re no longer the default when making deep networks. 41:43 That in a lot of places, the first thing you should think about trying is ReLU nonlinearities. 41:49 And so in particular part of why they’re good 41:55 is that ReLU networks train very quickly, because you get this sort of very straightforward 42:03 gradient back flow. Because providing you’re on the right hand side of it, you’re then just getting this sort of constant gradient back 42:11 flow from the slope 1. And so they train very quickly. The somewhat surprising fact is that almost the simplest 42:20 non-linearity imaginable is still enough to have a very good neural network, but it just is. 42:27 People have played around with variants of that. So people have then played around with leaky ReLUs, where rather than the left-hand side just 42:37 going completely to 0, it goes slightly negative on a much 42:44 shallower slope. And then there’s been a parametric ReLU where you have an extra parameter where you learn 42:50 the slope of the negative part. Another thing that’s been used recently 42:56 is this swish non-linearity, which looks almost like a ReLU, but it sort of curves down just a little bit there 43:04 and starts to go up. I mean, I think it’s fair to say that none of these 43:09 have really proven themselves vastly superior. There are papers saying I can get better results by using 43:15 one of these and maybe you can. But it’s not night and day and the vast majority 43:21 of work that you see around is still just using ReLUs in many places. 43:27 OK. Couple more things. Parameter initialization. 43:33 So in almost all cases, you must, must, must initialize the matrices of your neural nets 43:43 with small random values. Neural nets just don’t work if you start the matrices 43:50 off as zero, because effectively then everything is symmetric. 43:55 Nothing can specialize in different ways. And you just don’t have an ability for a neural net 44:05 to learn. You sort of get this defective solution. So standardly, you are using some methods 44:13 such as drawing random numbers uniformly between minus r and R 44:18 for a small value r and just filling in all the parameters with that. 44:24 Exception is with bias weights. It’s fine to set bias weights to 0 and in some sense that’s 44:30 better. In terms of choosing what the r value 44:35 is, essentially for traditional neural nets, 44:40 what we want to set that r range for is so that the numbers in our neural network 44:47 stay of a reasonable size. They don’t get too big and they don’t get too small. 44:53 And whether they kind of blow up or not depends on how many connections there are in the neural networks. 45:01 I’m looking at the fan in and fan out of connections in the neural network. 45:06 And so a very common initialization 45:11 that you’ll see in PyTorch is what’s called Xavier initialization, named after a person who 45:18 suggested that. And it’s working out a value based 45:24 on this fan in fan out of the layers that you can just sort of ask for it, 45:30 say initialize with this initialization and it will. This is another area where there have been 45:36 some subsequent developments. So around week 5, will start talking 45:42 about layer normalization. And if you’re using layer normalization, then it sort of doesn’t matter the same 45:48 how you initialize the weights. So finally, we have to train our models. 45:56 And I’ve briefly introduced the idea of stochastic gradient descent. And the good news is that most of the time that 46:05 training neural networks with stochastic gradient descent works just fine, use it and you will get good results. 46:16 However, often that requires choosing a suitable learning rate, which is my final slide of tips on the next slide. 46:25 But there’s been an enormous amount of work on optimization of neural networks 46:31 and people have come up with a whole series of more sophisticated optimizers. 46:37 And I’m not going to get into the details of optimization in this class. But the very loose idea is that these optimizers 46:46 are adaptive in that they can kind of keep track of how much slope there was, how much gradient there 46:54 is for different parameters. And therefore, based on that, make decisions as to how much to adjust the weights when 47:02 doing the gradient update rather than adjusting it by a constant amount. And so in that family of methods, 47:09 there are methods that include Adagrad, RMSprop, Adam. And then a variance of Adam including SparseAdam, AdamW, et 47:18 cetera. The one called Adam is a pretty good place to start. 47:24 And a lot of the time, that’s a good one to use. And again from the perspective of PyTorch, 47:30 when you’re initializing an optimizer, you can just say please use Adam and you don’t actually 47:35 need to know much more about it than that. If you are using simple stochastic gradient descent, 47:44 you have to choose the learning rate. So that was the eta value that you multiplied the gradient 47:51 by for how much to adjust the weights. And so I talked about that slightly 47:56 how you didn’t want it to be too big or your model could diverge or bounce around. 48:02 You didn’t want it to be too small or else the training could take place exceedingly slowly 48:09 and you’ll miss the assignment deadline. How big it should be, depends on all 48:15 sorts of details of the model. And so you want to sort of try out some different order of magnitude numbers 48:22 to see what numbers seem to work well for it training stably but reasonably quickly. 48:28 Something around 10 to the minus 3 or 10 to the minus 4 isn’t a crazy place to start. 48:34 In principle, you can do fine just using a constant learning rate in SGD. In practice, people generally find 48:42 they can get better results by decreasing learning rates as you train. So a very common recipe is that you halve the learning rate 48:51 after every k epochs, where an epoch means that you’ve made a pass through the entire set of training 48:58 data. So perhaps something like every three epochs, you have the learning rate. 49:05 And a final little note there in purple is when you make a pass through the data, you don’t want to go through the data items in the same order 49:12 each time. Because that leads you to to kind of have a sort of patterning of the training examples 49:20 that the model will sort of fall into that periodicity of those patterns. So it’s best to shuffle the data before each pass through it. 49:30 OK. There are more sophisticated ways to set learning rates and I won’t really get into those now. 49:38 Fancier optimizers like Adam also have a learning rate. So you still have to choose a learning rate value. 49:45 But it’s effectively it’s an initial learning rate, which typically the optimizer shrinks as it runs. 49:50 And so you commonly want to have the number it starts off with beyond the larger size because it’ll 49:56 be shrinking as it goes. OK. So that’s all by way of introduction, 50:03 and I’m now ready to start on language models and RNNs. So what is language modeling? 50:10 I mean, as two words of English, language modeling could mean just about anything. But in the natural language processing literature, 50:18 language modeling has a very precise technical definition, which you should know. 50:24 So language modeling is the task of predicting the word that comes next. 50:31 So if you have some context like the students opened there, 50:37 you want to be able to predict what words will come next. Is that their books, their laptops, their exams, 50:44 their minds. And so in particular, what you want to be doing 50:50 is being able to give a probability that different words will occur in this context. 50:58 So a language model is a probability distribution over next words given a preceding context. 51:06 And a system that does that is called a language model. 51:15 So as a result of that, you can also think of a language model as a system that assigns a probability 51:22 score to a piece of text. So if we have a piece of text, then we can just work out its probability 51:28 according to a language model. So the probability of a sequence of tokens, we can decompose via the chain of probability 51:37 of the first times probability of the second given the first et cetera, et cetera. And then we can work that out using what our language 51:45 model provides as a product of each probability of predicting the next word. 51:53 OK. Language models are really the cornerstone of human language technology. 52:00 Everything that you do with computers that involves human language, you are using language models. 52:09 So when you’re using your phone and it’s suggesting, whether well or badly, 52:14 what the next word that you probably want to type is, that’s the language model working to try and predict 52:20 the likely next words. When the same thing happens in a Google Doc 52:25 and it’s suggesting a next word or a next few words, that’s a language model. 52:32 The main reason why the one in Google Docs works much better than the one on your phone 52:37 is that for the keyboard phone models, they have to be very compact so that they can run quickly 52:44 on not much memory. So they’re sort of only mediocre language models, where 52:49 something like Google Docs can do a much better language modeling job. 52:54 Query completion, same thing. There’s a language model. And so then the question is, well, how 53:01 do we build language models? And so I briefly wanted to first again give 53:08 the traditional answer since you should have at least some understanding of how NLP was 53:15 done without a neural network. And the traditional answer that powered speech recognition 53:22 and other applications for at least two decades, three decades really, was what were called n-gram language 53:29 models. And these were a very simple, but still quite effective idea. 53:34 So we want to give probabilities of next words. 53:39 So what we’re gonna work with is what are referred to as n-grams. And so n-grams is just a chunk of n consecutive words 53:49 which are usually referred to as unigrams, bigrams, trigrams. And then 4-grams and 5-grams. 53:55 A horrible set of names, which would offend any humanist but that’s what people normally say. 54:03 And so effectively what we do is just collect statistics about how often different n-grams occur 54:11 in a large amount of text and then use those to build a probability model. 54:16 So the first thing we do is what’s referred to as making a Markov assumption. So these are also referred to as Markov models. 54:24 And we decide that the word in position t plus 1 only depends on the preceding n minus 1 words. 54:32 So if we want to predict t plus 1 54:38 given the entire preceding text, we actually throw away the early words and just use the preceding n 54:46 minus 1 words as context. Well, once we made that simplification, 54:52 we can then just use the definition of conditional probability and say all that conditional probability is the probability 54:59 of n words divided by the preceding n minus 1 words. 55:07 And so we have the probability of an n-gram over the probability of an n minus 1 gram. 55:14 And so then how do we get these n-gram and n minus 1 gram probabilities? We simply take a large amount of text in some language 55:22 and we count how often the different n-grams occur. 55:27 And so our crude statistical approximation starts off as the count of the n-gram over the count 55:34 of the n minus 1 gram. So here’s an example of that. Suppose we are learning a 4-gram language model. 55:41 OK. So we throw away all words apart from the last three words 55:46 and they’re our conditioning. 55:53 We look in some large– We use the counts from some large training corpus and we see how often did students 55:59 open their books occur, how often did students open their minds occur. 56:04 And then for each of those counts, we divide through by the count of how often students 56:09 open their occurred and that gives us our probability estimates. 56:16 So for example, if in the corpus, students open their occurred 1,000 times, 56:21 students opened their books occurred 400 times, we’d get a probability estimate of 0.4 for books. 56:29 If exams occurred 100 times, we’d get 0.1 for exams. And we sort of see here already the disadvantage 56:37 of having made the Markov assumption and have gotten rid of all of this earlier context, which 56:43 would have been useful for helping us to predict. The one other point that I’ll just 56:51 mention that I confused myself on is this count of the n-gram language model. 56:58 So for a 4-gram language model, it’s called a 4-gram language model. Because in its estimation, you’re 57:05 using 4 grams in the numerator and trigrams in the denominator. 57:11 So you use the size of the numerator. So that terminology is different to the terminology 57:18 that’s used in Markov models. So when people talk about the order of a Markov model, 57:24 that refers to the amount of context you’re using. So this would correspond to a third order Markov model. 57:32 Yeah, so someone said is this similar to a Naive Bayes model? 57:40 Sort of. Naive Bayes models, you also estimate the probabilities just by counting. 57:47 So they’re related and they’re sort of in some sense 57:53 two differences. The first difference or specialization 57:59 is that Naive Bayes models work out probabilities of words independent of their neighbors. 58:07 So what in one part that a Naive Bayes language model is a unigram brand language model. 58:14 So you’re just using the counts of individual words. But the other part of a Naive Bayes model 58:20 is you’re learning a different set of unigram counts for every class for your classifier. 58:27 58:32 And so effectively a Naive Bayes model is you’ve got class specific unigram language models. 58:41 Okay, I gave this as a simple statistical model 58:48 for estimating your probabilities with an n-gram model. You can’t actually get away with just doing that because you 58:55 have sparsity problems. So it often will be the case that for many words, 59:01 students open their books or students opened their backpacks just never occurred in the training data. 59:09 That if you think about it, if you have something like 10 to the 5th different words even and you want to have then 59:16 a sequence of four words and there are 10 to the 5th of each, there’s sort of 10 to the 20th different combinations. 59:24 So unless you’re seeing this truly astronomical amount of data, most four word sequences you’ve never seen. 59:31 So then your numerator will be 0 and your probability estimate will be 0. And so that’s bad. 59:37 And so the commonest way of solving that is just to add a little delta to every count and then 59:43 everything is non-zero. And that’s called smoothing. But well, sometimes it’s worse than that 59:49 because sometimes you won’t even have seen students open theirs. And that’s more problematic, because that 59:55 means our denominator here is 0 and so the division 1:00:00 will be ill-defined. And we can’t usefully calculate any probabilities in a context 1:00:05 that we’ve never seen. And so the standard solution to that is to shorten the context 1:00:11 and that’s called back off. So we condition only on opened their. Or if we still haven’t seen the opened their, 1:00:19 we’ll condition only on their. Or we could just forget all conditioning 1:00:25 and actually use a unigram model for our probabilities. 1:00:33 Yeah, and so as you increase the order n of the n-gram language model, these sparsity problems become 1:00:41 worse and worse. So in the early days people normally worked with trigram models. As it became easier to collect billions of words of text, 1:00:50 people commonly moved to 5-gram models. But every time you go up an order of conditioning, 1:00:58 you effectively need to be collecting orders of magnitude more data because 1:01:03 of the size of the vocabularies of human languages. 1:01:09 There’s also a problem that these models are huge. So you basically have to be caught 1:01:15 storing counts of all of these word sequences so you can work out these probabilities. 1:01:21 And I mean, that’s actually had a big effect on what terms of what technology is available. 1:01:27 So in the 2000s decade up until that, whenever it was, 2014, 1:01:34 that there was already Google Translate using probabilistic models that included language models 1:01:43 of the n-gram language model sort. But the only way they could possibly be run is in the cloud 1:01:49 because you needed to have these huge tables of probabilities. 1:01:54 But now we have neural nets and you can have Google Translate just actually run on your phone. 1:01:59 And that’s possible, because neural net models can be massively more compact than these old n-gram language 1:02:07 models. 1:02:13 But nevertheless, before we get onto the neural models, let’s just sort of look at an example of how these work. 1:02:24 So it’s trivial to train an n-gram language model because you really just count how often word sequences occur in a corpus, 1:02:32 and you’re ready to go. So these models can be trained in seconds. That’s really good. That’s not like sitting around for training neural networks. 1:02:39 So if I train on my laptop a small language model, about 1.7 1:02:48 million words as a trigram model, I can then ask it to generate text. 1:02:53 If I give it a couple of words, today the, I can then get it to sort of suggest 1:02:58 a word that might come next. And the way I do that is the language model knows the probability distribution 1:03:06 of things that can come next. Know that there’s a kind of a crude probability distribution. 1:03:12 I mean, because effectively over this relatively small corpus, there were things that occurred once, Italian and emirate. 1:03:20 There are things that occurred twice, price. There were things that occurred four times, company and bank. 1:03:26 It’s sort of fairly crude and rough. But I nevertheless get probability estimates. I can then say, OK, based on this, 1:03:36 let’s take this probability distribution and then we’ll just sample the next word. 1:03:42 So the two most likely words to sample are company or bank. But we’re rolling the dice and we 1:03:48 might get any of the words that had come next. So maybe I sample price. 1:03:53 Now I will condition on the price and look up the probability distribution 1:04:00 of what comes next. The most likely thing is of. And so again, I’ll sample and maybe this time I’ll 1:04:06 pick up of. And then I will now condition on price of, 1:04:12 and I will look up the probability distribution of words following that. 1:04:17 And I get this probability distribution. And I’ll sample randomly some word from it 1:04:24 and maybe this time I’ll sample a rare but possible one like gold. 1:04:29 And I can keep on going, and I’ll get out something like this. Today the price of gold per ton, while production of shoe lasts 1:04:38 and shoe industry, the bank intervened just after it considered and rejected the IMF demand 1:04:43 to rebuild depleted European stocks, Sep 30 end primary $0.76 a share. 1:04:50 So what just a simple trigram model can produce over not very much text 1:04:56 is actually already kind of interesting. It’s actually surprisingly grammatical, right? 1:05:02 There are whole pieces of it while production of shoe lasts and shoe industry, the bank 1:05:07 intervened just after it considered and rejected an IMF demand is really actually pretty good grammatical text. 1:05:14 So it’s sort of amazing that these simple n-gram models actually can model a lot of human language. 1:05:21 On the other hand, it’s not a very good piece of text. It’s completely incoherent and makes no sense. 1:05:29 And so to actually be able to generate text that seems like it makes sense, 1:05:35 we’re going to need a considerably better language model. And that’s precisely what neural language models have allowed 1:05:42 us to build as we’ll see later. OK. So how can we build a neural language model? 1:05:50 And so first of all, we’re going to do a simple one and then we’ll see where we get. 1:05:56 But to move into recurrent neural nets might still take us to the next time. 1:06:02 So we can have input sequence of words and we want a probability distribution 1:06:07 over the next word. The simplest thing that we could try is to say, well, kind of the only tool 1:06:15 we have so far is a window based classifier. 1:06:22 So what we’ve done previously, either for our named entity recognizer in lecture three or what I just showed you 1:06:29 for the dependency parser is we have some context window, we put it through a neural net and we predict something 1:06:36 as a classifier. So before we were predicting a location. But maybe instead, we can reuse exactly the same technology. 1:06:47 And say we’re going to have a window based classifier. So we’re discarding the further away 1:06:52 words just like in a n-gram language model, but we’ll feed this fixed window into a neural net. 1:07:02 So we concatenate the word embeddings, we put it through a hidden layer and then we have a softmax classifier 1:07:11 over our vocabulary. And so now rather than predicting something 1:07:17 like location or left arc in the dependency parser, we’re going to have a softmax over the entire vocabulary. 1:07:25 Sort of like we did with the skip gram negative sampling model in the first two lectures. 1:07:31 And so we’re going to see this choice as predicting what word that comes next, 1:07:37 whether it produces laptops, minds, books, et cetera. 1:07:43 OK, so this is a fairly simple fixed window neural net classifier. 1:07:49 But this is essentially a famous early model 1:07:55 in the use of neural nets for NLP applications. So first a 2000 conference paper and then somewhat later, 1:08:04 journal paper. Yoshua Bengio and colleagues introduced precisely this model 1:08:10 as the neural probabilistic language model. And they were already able to show 1:08:17 that this could give interesting, good results for language modeling. And so it wasn’t a great solution for neural language 1:08:26 modeling, but it still had value. So it didn’t solve the problem of allowing 1:08:32 us to have bigger contexts to predict what words are going to come next. It’s in that way limited exactly like an n-gram language 1:08:41 model is, but it does have all the advantages of distributed representations. 1:08:47 So rather than having these counts for words sequences that are very sparse and very crude, 1:08:56 we can use distributed representations of words which then make predictions that semantically similar words 1:09:05 should give similar probability distributions. So the idea of that is if we use some other word here, 1:09:13 like maybe the pupils opened their, well, maybe in our training data we’d seen sentences about students, 1:09:21 but we’ve never seen sentences about pupils. An n-gram language model then would sort of 1:09:26 have no idea what probabilities to use. Whereas a neural language model can say, 1:09:32 well pupils is kind of similar to students. Therefore, I can predict similarly to what I would have predicted for students. 1:09:41 OK. So there’s now no sparsity problem. We don’t need to store billions of n-gram counts, 1:09:51 we simply need to store our word vectors and our W and new matrices. 1:09:59 But we still have the remaining problems that our fixed window is too small. 1:10:04 We can try and make the window larger. If we do that W, the W matrix gets bigger. 1:10:11 But that also points out another problem with this model. Not only can the window never be large 1:10:17 enough, but W is just a trained matrix. 1:10:23 And so therefore, we’re learning completely different weights for each position of context, the word minus 1 position, 1:10:31 the word minus 2, the word minus 3, and the word minus 4. So that there’s no sharing in the model as to how it treats 1:10:40 words in different positions, even though in some sense they will contribute semantic components 1:10:48 that are at least somewhat position independent. So again, if you sort of think back 1:10:55 to either a Naive Bayes model or what we saw with the Word2Vec model at the beginning, 1:11:01 the Word2Vec model or Naive Bayes model completely ignores word order. So it has one set of parameters regardless of what 1:11:09 position things occur in. That doesn’t work well for language modeling, because word order is really important in language modeling. 1:11:16 If the last word is the, that’s a really good predictor of there being an adjective or noun following where 1:11:23 the word four back is the, it doesn’t give you the same information. 1:11:28 So you do want to somewhat make use of word order. 1:11:33 But this model is at the opposite extreme that each position is being modeled 1:11:39 completely independently. So what we’d like to have is a neural architecture 1:11:46 that can process an arbitrary amount of context 1:11:51 and have more sharing of the parameters while still be sensitive to proximity. 1:11:58 And so that’s the idea of recurrent neural networks. And I’ll say about five minutes about these today. 1:12:06 And then next time we’ll return and do more about recurrent neural networks. 1:12:11 So for the recurrent neural network, rather than having a single hidden 1:12:18 layer inside our classifier here that we compute each time, 1:12:24 for the recurrent neural network we have the hidden layer, which is often referred 1:12:30 to as the hidden state. But we maintain it over time and we feed it back into itself. 1:12:38 So that’s what the word recurrent is meaning, that you’re sort of feeding the hidden layer back 1:12:43 into itself. So what we do is based on the first word 1:12:50 we compute a hidden representation, like before, which can be used to predict the next word. 1:12:58 But then for when we want to predict what comes after the second word, 1:13:04 we not only feed in the second word, we feed in the hidden layer from the previous word 1:13:13 to have it help predict the hidden layer above the second word. And so formally the way we’re doing that is we’re 1:13:20 taking the hidden layer above the first word, multiplying it by a matrix W. And then 1:13:28 that’s going to be going in together with x2 to generate the next hidden step. 1:13:35 And so we keep on doing that at each time step so that we’re repeating a pattern of creating 1:13:43 a next hidden layer based on the next input word and the previous hidden state 1:13:50 by updating it by multiplying it by a matrix W. OK. So on my slide here. 1:13:56 I’ve still only got four words of context because it’s nice for my slide. But in principle, there could be any number 1:14:03 of words of context now. OK. So what we’re doing is that we start off 1:14:11 by having input vectors, which can be our word vectors that we’ve looked up for each word. 1:14:17 So sorry, yeah, so we can have the one hot vectors 1:14:23 for word identity. We look up our word embedding, so then we got word embeddings 1:14:28 for each word. And then we want to compute hidden states. So we need to start from somewhere. 1:14:35 h0 is the initial hidden state, and h0 is normally taken as a 0 vector. 1:14:42 So this is actually just initialized to 0s. And so for working out the first hidden state, 1:14:49 we calculated based on the first word’s embedding by multiplying this embedding by a matrix, W_e, 1:14:59 and that gives us the first hidden state. But then as we go on, we want to apply the same formula over 1:15:08 again. So we have just two parameter matrices 1:15:14 in the recurrent neural network. One matrix for multiplying input embeddings and one matrix 1:15:21 for updating the hidden state of the network. And so for the second word from its word embedding, 1:15:29 we multiply it by the W_e matrix. We take the previous time steps in state 1:15:36 and multiply it by the W_h matrix. And we use the two of those to generate the new hidden state. 1:15:45 And precisely how we generate the new hidden state is then be shown on this equation on the left. 1:15:52 So we take the previous hidden state, multiply it by W_h. We take the input embedding, multiply it by W_e. 1:16:00 We sum those two, we add on a learn bias weight. 1:16:06 And then we put that through a non-linearity. And although on this slide that non-linearity 1:16:13 is written as sigma by far the most common non-linearity to use here actually is a tanh non-linearity. 1:16:22 And so this is the core equation for a simple recurrent neural 1:16:27 network. And for each successive time step, we’re just going to keep on applying 1:16:33 that to work out hidden states. And then from those hidden states, 1:16:38 we can use them just like in our window classifier to predict what would be the next word. 1:16:45 So at any position, we can take this hidden vector, put it 1:16:51 through a softmax layer, which is multiplying by u matrix and adding on another bias and then making a softmax 1:16:57 distribution out of that. And that will then give the probability distribution over next words. 1:17:03 What we saw here, right, this is the entire math of a simple recurrent neural network. 1:17:11 And next time I’ll come back and say more about them, but this is the entirety of what you 1:17:19 need to know in some sense for the computation of the forward model of a simple recurrent neural network. 1:17:26 So the advantages we have now as it can process a text input of any length. 1:17:34 In theory at least, it can use information from any number of steps back. 1:17:39 We’ll talk more about in practice how well that actually works. The model size is fixed. 1:17:45 It doesn’t matter how much of a past context there is. All we have is that W_h and W_e parameters. 1:17:53 And at each time step, we use exactly the same weights to update our hidden state. 1:18:00 So there’s asymmetry in how different inputs are processed in producing our predictions. 1:18:08 RNNs in practice though, the simple RNNs in practice aren’t perfect. 1:18:13 So a disadvantage is that they’re actually kind of slow. Because with this recurrent computation, in some sense 1:18:22 we are sort of stuck with having to have on the outside of for loop. So we can do vector matrix multiplies on the inside here, 1:18:30 but really we have to do for time step equals 1 to n calculate the success of hidden states. 1:18:39 And so that’s not a perfect neural net architecture and we’ll discuss alternatives to that later. 1:18:47 And although in theory this model can access information any number of steps 1:18:52 back, in practice we find that it’s pretty imperfect at doing that. 1:18:57 And that will then lead to more advanced forms of recurrent neural network that I’ll talk about next time that are able to more effectively access 1:19:08 past context. OK. I think I’ll stop there for the day. 1:19:13

Deep Learning