- 목차
- INRTO
- 0) 데이터 살펴보기
- The training takes about 2 minutes to run
- set the values we want to override
- disease domain inference sample answers should be 0, 1, 2
- Run the training script, overriding the config values in the command line
- The training takes about 2 minutes to run
- set the values we want to override
- disease domain inference sample answers should be 0, 1, 2
- Override the config values in the command line
- FIXME
- complete list of supported BERT-like models
- Restart the kernel
- TODO Try your own experiment with a different language model!
- TODO Try your own experiment with a different language model!
목차
👀, 🤷♀️ , 📜
이 아이콘들을 누르시면 코드, 개념 부가 설명을 보실 수 있습니다:)
INRTO
이번 포스트에선 NVIDIA의 NeMo를 써서 bert를 사용해 볼 것이다
[목표]
사전 트레이닝된 대규모 고성능 Transformer 기반 언어 모델흘 사용하여 NLP 작업,
- 0) 데이터 살펴보기: NCBI 질병 언어 자료에서 가져온 데이터세트를 사용
- 1) pretraining: 텍스트 분류
- 2) fine tuning: NER(명명된 엔터티 인식)구축
[읽기 위해 필요한 지식]
- attention
- transformer: 정말 무조건무조건 정독하자 이거 읽고오면 이번 포스트는 껌이다.
- bert
[원 논문]
Attention is All You Need
bert
[참고자료]
NEMO github
0) 데이터 살펴보기
[Corpus Annotated Data]
14명의 주석자가 어노테이션을 진행 한 단 793개의 PubMed Abstract
명확하게 정의된 규칙을 사용해 Abstract 텍스트에 삽입되는 HTML 스타일 태그의 형식을 지님
다른 데이터에 적용을해보려면 그 데이터를 로드하면 된다
1) pretraining: 테스트 분류 데이터 세트
텍스트의 내용에 따라 텍스트를 분류하는 것이 목표이다.
텍스트 분류 적용 예
- 감정분석: 2개의 클래스
- 주제 레이블 지정: 여러클래스
필요한 데이터 세트를 이해하려면 하고싶은 질문이 뭔지부터 정해야한다.
📜 감정분석
이진분류임(2개의 클래스)
- 감정이 긍정적인가
- 감정이 부정적인가
즉 각 문장에 이 두가지 중 하나로 레이블을 지정해야한다.
➡ 기업들이 온라인 대화와 피드백에서 제품, 브랜드 또는 서비스에 대한 고객의 감정을 파악하는 데 널리 사용됨
📜 주제레이블 지정
다중분류임(Multi-Class Analysis)
- ex) 주어진 의학 질병 Abstract은 암 또는 신경계 질환, 아니면 그 밖의 다른 질병에 대한 Abstract인가?
데이터 형식
.tsv
형식으로 저장:.csv
쉼표로 구분된 형식과 유사하지만쉼표대신 탭을 사용해 열을 구분
[데이터 가져오기] 먼저 pandas를 이용하여 데이터를 가져와 보자
👀 데이터 확인 코드 보기
쓰고싶은 데이터를 각자 가져오면 된다
import pandas as pd
pd.options.display.max_colwidth = None
train_df = pd.read_csv(TC_DATA_DIR + 'train.tsv', sep='\t')
train_df.head()
결과
Identification of APC2, a homologue of the adenomatous polyposis coli tumour suppressor . The adenomatous polyposis coli ( APC ) tumour-suppressor protein controls the Wnt signalling pathway by forming a complex with glycogen synthase kinase 3beta ( GSK-3beta ) , axin / conductin and betacatenin . Complex formation induces the rapid degradation of betacatenin . In colon carcinoma cells , loss of APC leads to the accumulation of betacatenin in the nucleus , where it binds to and activates the Tcf-4 transcription factor ( reviewed in [ 1 ] [ 2 ] ) . Here , we report the identification and genomic structure of APC homologues . Mammalian APC2 , which closely resembles APC in overall domain structure , was functionally analyzed and shown to contain two SAMP domains , both of which are required for binding to conductin . Like APC , APC2 regulates the formation of active betacatenin-Tcf complexes , as demonstrated using transient transcriptional activation assays in APC - / - colon carcinoma cells . Human APC2 maps to chromosome 19p13 . 3 . APC and APC2 may therefore have comparable functions in development and cancer . 0
...
2) fine tuning: NER 데이터 세트
NER 작업에서는 새로운 질문을 할 것입니다.
➡ 의학 Abstract의 주어진 문장에서 어떤 질병이 언급되었나요?
➡ 이 경우 데이터 입력은 Abstract의 문장이며 레이블은 명명된 질병 엔터티의 위치이다.
데이터세트에 대해 제공된 정보를 살펴보자.
👀 데이터 확인 코드 보기
쓰고싶은 데이터를 각자 가져오면 된다
NER_DATA_DIR = '/dli/task/data/NCBI_ner-3/'
!ls -lh $NER_DATA_DIR
결과
total 4.0M
-rw-r--r-- 1 702112 10513 181K Jul 13 2020 dev.tsv
-rw-r--r-- 1 702112 10513 5 Jul 13 2020 label_ids.csv
-rw-r--r-- 1 702112 10513 52 Jul 13 2020 label_stats.tsv
-rw-r--r-- 1 702112 10513 48K Jul 13 2020 labels_dev.txt
-rw-r--r-- 1 702112 10513 49K Jul 13 2020 labels_test.txt
-rw-r--r-- 1 702112 10513 271K Jul 13 2020 labels_train.txt
-rw-r--r-- 1 702112 10513 185K Jul 13 2020 test.tsv
-rw-r--r-- 1 702112 10513 135K Jul 13 2020 text_dev.txt
-rw-r--r-- 1 702112 10513 138K Jul 13 2020 text_test.txt
-rw-r--r-- 1 702112 10513 758K Jul 13 2020 text_train.txt
-rw-r--r-- 1 702112 10513 1023K Jul 13 2020 train.tsv
-rw-r--r-- 1 702112 10513 1.2M Jul 13 2020 train_dev.tsv
NER 작업을 수행하려면 두 개의 파일이 필요함
➡ 텍스트 문장과 레이블 파일. 다음 2개의 셀을 실행하여 이 두 파일의 샘플을 확인함
!head $NER_DATA_DIR/text_train.txt
결과
Identification of APC2 , a homologue of the adenomatous polyposis coli tumour suppressor .
The adenomatous polyposis coli ( APC ) tumour - suppressor protein controls the Wnt signalling pathway by forming a complex with glycogen synthase kinase 3beta ( GSK - 3beta ) , axin / conductin and betacatenin .
Complex formation induces the rapid degradation of betacatenin .
In colon carcinoma cells , loss of APC leads to the accumulation of betacatenin in the nucleus , where it binds to and activates the Tcf - 4 transcription factor ( reviewed in [ 1 ] [ 2 ] ) .
Here , we report the identification and genomic structure of APC homologues .
Mammalian APC2 , which closely resembles APC in overall domain structure , was functionally analyzed and shown to contain two SAMP domains , both of which are required for binding to conductin .
Like APC , APC2 regulates the formation of active betacatenin - Tcf complexes , as demonstrated using transient transcriptional activation assays in APC - / - colon carcinoma cells .
Human APC2 maps to chromosome 19p13 .
3 .
APC and APC2 may therefore have comparable functions in development and cancer .
[IOB 태그 지정]
Abstract이 문장들로 나뉘어진 것을 볼 수 있습니다. 그런 다음 각 문장이 코퍼스의 원래 HTML 스타일 태그에 해당하는 레이블이 있는 단어로 추가로 구문 분석된다
- IOB: Inside, Outside, Beginning
이해가 안된다면 넘어가도 괜찮다
NER 트레이닝 데이터는 단어를 I, O, B(inside, outside, beginning)와 같은 태그에 매핑하여 엔터티를 식별하는 것이다.
👀 데이터 확인 코드 보기
!head $NER_DATA_DIR/labels_train.txt
결과
O O O O O O O O B I I I O O
O B I I I I I I O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
O O O O O O O O O
O B I O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
O O O O O O O O O O O O O
O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
O O O O O O O O O O O O O O O O O O O O O O O O O O B I O O
O O O O O O O
O O
O O O O O O O O O O O B O
즉 이 태그들이 이런식으로 지정되어있다.
Identification of APC2 , a homologue of the adenomatous polyposis coli tumour suppressor .
O O O O O O O O B I I I O O
그러므로 원래 코퍼스 태그로 재 호출하면,
Identification of APC2, a homologue of the <category="Modifier">adenomatous polyposis coli tumour</category> suppressor .
``
</div>
</details>
----
------
# **1) pretraining: 텍스트 분류 구축**
**[목표]**
의학적 질병 abstract를 각각 암 질환, 신경 질환 및 장애, 그리고 기타의 세가지 분류 중 하나로 분류할 수 있는 애플리케이션을 구축하는 것.
## NeMo 개요
NeMo는 대화형 AI 애플리케이션을 구축하기 위한 오플 소스 툴킷입니다.
* NeMo는 Neural Modules을 중심으로 구축되어, **유형화된 입력**을 받아 **유형화된 아웃풋**으로 생성하는 뉴럴 네트워크의 개념 블록(conceptual blocks)으로 되어 있음.
* 모듈: 레이어, 인코더, 디코더, 언어 모델, 손실 함수, 또는 결합된 액티베이션 방법
NeMo는 이러한 빌딩 블록들을 **결합**하고 **재사용하기 쉽게 만들어주면**서 Neural Type 시스템을 통해 **의미론적 정확도 검사**(a level of semantic correctness checking)를 제공
**[NeMo 딥러닝 프레임워크]**
[프레임 워크란?](https://yerimoh.github.io/DL108/#%ED%94%84%EB%A0%88%EC%9E%84%EC%9B%8C%ED%81%AC)
* [PyTorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning): 뉴럴 네트워크 트레이닝을 위한 PyTorch 코드를 정리한 Pytorch wrapper
* 쉽고 고성능의 멀티-GPU/멀티-노트 혼합 정밀 트레이닝(mixed precision training) 옵션을 제공.
* [LightningModule(라이트닝모듈)](https://github.com/PyTorchLightning/pytorch-lightning)
* 트레이닝, 유효성 검사, 테스트를 위한 연산, 옵티마이저, 루프 문으로 PyTorch 코드로 구성하는데 활용
* 딥 러닝 실험을 이해하고 재생산하기 더 쉽게 만들어줌
* [Trainer(트레이너)](https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html)
* LightningModule(라이트닝모듈)을 가지고 올 수 있으며 딥 러닝 트레이닝을 위해 필요한 모든 것을 자동화 가능
**[NeMo 모델]**
NeMo 모델은 트레이닝과 재현성을 위한 모든 지원 인프라를 갖춘 LightningModules(라이트닝모듈) 입니다.
(기능)
* 딥 러닝 모델 아키텍처
* 데이터 사전 처리
* 옵티마이저
* 체크포인트
* 실험 로깅 기능
NeMo 모델은 라이트닝모듈과 같이 PyTorch 모듈이며 더 넓은 PyTorch 생태계와 완벽하게 호환됨.
어떤 NeMo model도 모든 PyTorch 워크플로우에 연결가능
이 포스트에선, NGC NeMo container에 기반한 실습 환경에 포함된 로컬 repo 복사본를 이용할 계획이며 NLP 모델에 집중할 예장임.
다음 셀을 실행하여 examples/nlp 디렉토리에 있는 NeMo 모델 트리를 확인합니다.
<details>
<summary>👀코드 보기</summary>
<div markdown="1">
확인이 안되면 그냥 넘어가도 상관 없다
```python
!tree nemo/examples/nlp -L 2
결과
nemo/examples/nlp
├── dialogue_state_tracking
│ ├── conf
│ └── sgd_qa.py
├── entity_linking
│ ├── build_index.py
│ ├── conf
│ ├── data
│ ├── query_index.py
│ └── self_alignment_pretraining.py
├── glue_benchmark
│ ├── glue_benchmark.py
│ └── glue_benchmark_config.yaml
├── information_retrieval
│ ├── bert_dpr.py
│ ├── bert_joint_ir.py
│ ├── conf
│ ├── construct_random_negatives.py
│ └── get_msmarco.sh
├── intent_slot_classification
│ ├── conf
│ ├── data
│ └── intent_slot_classification.py
├── language_modeling
│ ├── bert_pretraining.py
│ ├── conf
│ ├── convert_weights_to_nemo1.0.py
│ ├── get_wkt2.sh
│ └── transformer_lm.py
├── machine_translation
│ ├── conf
│ ├── create_tarred_monolingual_dataset.py
│ ├── create_tarred_parallel_dataset.py
│ ├── enc_dec_nmt.py
│ ├── nmt_transformer_infer.py
│ ├── nmt_webapp
│ ├── preprocess_dataset.py
│ └── translate_ddp.py
├── question_answering
│ ├── conf
│ ├── get_squad.py
│ └── question_answering_squad.py
├── text2sparql
│ ├── conf
│ ├── data
│ ├── evaluate_text2sparql.py
│ └── text2sparql.py
├── text_classification
│ ├── conf
│ ├── data
│ ├── model_parallel_text_classification_evaluation.py
│ └── text_classification_with_bert.py
└── token_classification
├── conf
├── data
├── punctuation_capitalization_evaluate.py
├── punctuation_capitalization_train.py
├── token_classification_evaluate.py
└── token_classification_train.py
27 directories, 31 files
클래식한 NLP 태스크를 다루는 여러 모델을 확인 가능하다.
이번 포스트에서는 텍스트 분류에 대해 초점을 맞출 예정이며, 다음에 있을 명명된 엔티티 인식(NER) 노트북에서는 토큰 분류 을 중심으로 실습을 진행할 예정이다.
👀 텍스트 분류에 대한 자세한 내용 확인 코드 보기
확인이 안되어도 괜찮다.
TC_DIR = "/dli/task/nemo/examples/nlp/text_classification"
!tree $TC_DIR
/dli/task/nemo/examples/nlp/text_classification
├── conf
│ └── text_classification_config.yaml
├── data
│ └── import_datasets.py
├── model_parallel_text_classification_evaluation.py
└── text_classification_with_bert.py
2 directories, 4 files
환경 파일 text_classification_config.yaml
- 파일 위치, 사전 훈련된 모델, 하이퍼 파라미터와 같은 모델, 트레이닝, 실험 관리를 위한 세부 정보를 지정
Python 스크립트 text_classification_with_bert.py
- 환경 파일에서 정의된 텍스트 분류 실험을 진행하는데 필요한 모든 것을 캡슐화함
- Configuration(환경 설정) 관리를 위해 페이스북의 Hydra 도구를 사용하여 커맨드 라인 옵션을 사용하여 구성 값을 재정의할 수 있도록 하여 스크립트와 함께 전체 실험을 진행할수 있도록 함.
커맨드 라인에서의 텍스트 분류
[목표] 이번 실습에서 답변하고자 하는 질문은 3개의 등급 텍스트 분류 문제
- (0) “cancer - 암” : 주어진 의학적 질병에 대한 Abstract는 암에 대한 것인가요?
- (1) “neurological disorders - 신경 장애”: 아니면 신경 장애에 관련된 것인가요?
- (2) “other -기타” : 아니면 그 외의 것인가요?
[데이터 준비하기]
이미 앞에서 했지만 다시 복습
👀코드 보기
TC3_DATA_DIR = '/dli/task/data/NCBI_tc-3'
!ls $TC3_DATA_DIR/*.tsv
결과
/dli/task/data/NCBI_tc-3/dev.tsv /dli/task/data/NCBI_tc-3/train.tsv
/dli/task/data/NCBI_tc-3/test.tsv
# Take a look at the tab separated data
print("*****\ntrain.tsv sample\n*****")
!head -n 3 $TC3_DATA_DIR/train.tsv
print("\n\n*****\ndev.tsv sample\n*****")
!head -n 3 $TC3_DATA_DIR/dev.tsv
print("\n\n*****\ntest.tsv sample\n*****")
!head -n 3 $TC3_DATA_DIR/test.tsv
결과
*****
train.tsv sample
*****
sentence label
Identification of APC2, a homologue of the adenomatous polyposis coli tumour suppressor . The adenomatous polyposis coli ( APC ) tumour-suppressor protein controls the Wnt signalling pathway by forming a complex with glycogen synthase kinase 3beta ( GSK-3beta ) , axin / conductin and betacatenin . Complex formation induces the rapid degradation of betacatenin . In colon carcinoma cells , loss of APC leads to the accumulation of betacatenin in the nucleus , where it binds to and activates the Tcf-4 transcription factor ( reviewed in [ 1 ] [ 2 ] ) . Here , we report the identification and genomic structure of APC homologues . Mammalian APC2 , which closely resembles APC in overall domain structure , was functionally analyzed and shown to contain two SAMP domains , both of which are required for binding to conductin . Like APC , APC2 regulates the formation of active betacatenin-Tcf complexes , as demonstrated using transient transcriptional activation assays in APC - / - colon carcinoma cells . Human APC2 maps to chromosome 19p13 . 3 . APC and APC2 may therefore have comparable functions in development and cancer . 0
A common MSH2 mutation in English and North American HNPCC families: origin, phenotypic expression, and sex specific differences in colorectal cancer . The frequency , origin , and phenotypic expression of a germline MSH2 gene mutation previously identified in seven kindreds with hereditary non-polyposis cancer syndrome (HNPCC) was investigated . The mutation ( A-- > T at nt943 + 3 ) disrupts the 3 splice site of exon 5 leading to the deletion of this exon from MSH2 mRNA and represents the only frequent MSH2 mutation so far reported . Although this mutation was initially detected in four of 33 colorectal cancer families analysed from eastern England , more extensive analysis has reduced the frequency to four of 52 ( 8 % ) English HNPCC kindreds analysed . In contrast , the MSH2 mutation was identified in 10 of 20 ( 50 % ) separately identified colorectal families from Newfoundland . To investigate the origin of this mutation in colorectal cancer families from England ( n = 4 ) , Newfoundland ( n = 10 ) , and the United States ( n = 3 ) , haplotype analysis using microsatellite markers linked to MSH2 was performed . Within the English and US families there was little evidence for a recent common origin of the MSH2 splice site mutation in most families . In contrast , a common haplotype was identified at the two flanking markers ( CA5 and D2S288 ) in eight of the Newfoundland families . These findings suggested a founder effect within Newfoundland similar to that reported by others for two MLH1 mutations in Finnish HNPCC families . We calculated age related risks of all , colorectal , endometrial , and ovarian cancers in nt943 + 3 A-- > T MSH2 mutation carriers ( n = 76 ) for all patients and for men and women separately . For both sexes combined , the penetrances at age 60 years for all cancers and for colorectal cancer were 0 . 86 and 0 . 57 , respectively . The risk of colorectal cancer was significantly higher ( p 0 . 01 ) in males than females ( 0 . 63 v 0 . 30 and 0 . 84 v 0 . 44 at ages 50 and 60 years , respectively ) . For females there was a high risk of endometrial cancer ( 0 . 5 at age 60 years ) and premenopausal ovarian cancer ( 0 . 2 at 50 years ) . These intersex differences in colorectal cancer risks have implications for screening programmes and for attempts to identify colorectal cancer susceptibility modifiers . 0
*****
dev.tsv sample
*****
sentence label
BRCA1 is secreted and exhibits properties of a granin. Germline mutations in BRCA1 are responsible for most cases of inherited breast and ovarian cancer . However , the function of the BRCA1 protein has remained elusive . We now show that BRCA1 encodes a 190-kD protein with sequence homology and biochemical analogy to the granin protein family . Interestingly , BRCA2 also includes a motif similar to the granin consensus at the C terminus of the protein . Both BRCA1 and the granins localize to secretory vesicles , are secreted by a regulated pathway , are post-translationally glycosylated and are responsive to hormones . As a regulated secretory protein , BRCA1 appears to function by a mechanism not previously described for tumour suppressor gene products . . 0
Ovarian cancer risk in BRCA1 carriers is modified by the HRAS1 variable number of tandem repeat (VNTR) locus. Women who carry a mutation in the BRCA1 gene ( on chromosome 17q21 ) , have an 80 % risk of breast cancer and a 40 % risk of ovarian cancer by the age of 70 ( ref . 1 ) . The variable penetrance of BRCA1 suggests that other genetic and non-genetic factors play a role in tumourigenesis in these individuals . The HRAS1 variable number of tandem repeats ( VNTR ) polymorphism , located 1 kilobase ( kb ) downstream of the HRAS1 proto-oncogene ( chromosome 11p15 . 5 ) is one possible genetic modifier of cancer penetrance . Individuals who have rare alleles of the VNTR have an increased risk of certain types of cancers , including breast cancer ( 2-4 ) . To investigate whether the presence of rare HRAS1 alleles increases susceptibility to hereditary breast and ovarian cancer , we have typed a panel of 307 female BRCA1 carriers at this locus using a PCR-based technique . The risk for ovarian cancer was 2 . 11 times greater for BRCA1 carriers harbouring one or two rare HRAS1 alleles , compared to carriers with only common alleles ( P = 0 . 015 ) . The magnitude of the relative risk associated with a rare HRAS1 allele was not altered by adjusting for the other known risk factors for hereditary ovarian cancer ( 5 ) . Susceptibility to breast cancer did not appear to be affected by the presence of rare HRAS1 alleles . This study is the first to show the effect of a modifying gene on the penetrance of an inherited cancer syndrome 0
*****
test.tsv sample
*****
sentence label
Clustering of missense mutations in the ataxia-telangiectasia gene in a sporadic T-cell leukaemia. Ataxia-telangiectasia ( A-T ) is a recessive multi-system disorder caused by mutations in the ATM gene at 11q22-q23 ( ref . 3 ) . The risk of cancer , especially lymphoid neoplasias , is substantially elevated in A-T patients and has long been associated with chromosomal instability . By analysing tumour DNA from patients with sporadic T-cell prolymphocytic leukaemia ( T-PLL ) , a rare clonal malignancy with similarities to a mature T-cell leukaemia seen in A-T , we demonstrate a high frequency of ATM mutations in T-PLL . In marked contrast to the ATM mutation pattern in A-T , the most frequent nucleotide changes in this leukaemia were missense mutations . These clustered in the region corresponding to the kinase domain , which is highly conserved in ATM-related proteins in mouse , yeast and Drosophila . The resulting amino-acid substitutions are predicted to interfere with ATP binding or substrate recognition . Two of seventeen mutated T-PLL samples had a previously reported A-T allele . In contrast , no mutations were detected in the p53 gene , suggesting that this tumour suppressor is not frequently altered in this leukaemia . Occasional missense mutations in ATM were also found in tumour DNA from patients with B-cell non-Hodgkins lymphomas ( B-NHL ) and a B-NHL cell line . The evidence of a significant proportion of loss-of-function mutations and a complete absence of the normal copy of ATM in the majority of mutated tumours establishes somatic inactivation of this gene in the pathogenesis of sporadic T-PLL and suggests that ATM acts as a tumour suppressor . As constitutional DNA was not available , a putative hereditary predisposition to T-PLL will require further investigation . . 0
Myotonic dystrophy protein kinase is involved in the modulation of the Ca2+ homeostasis in skeletal muscle cells. Myotonic dystrophy ( DM ) , the most prevalent muscular disorder in adults , is caused by ( CTG ) n-repeat expansion in a gene encoding a protein kinase ( DM protein kinase ; DMPK ) and involves changes in cytoarchitecture and ion homeostasis . To obtain clues to the normal biological role of DMPK in cellular ion homeostasis , we have compared the resting [ Ca2 + ] i , the amplitude and shape of depolarization-induced Ca2 + transients , and the content of ATP-driven ion pumps in cultured skeletal muscle cells of wild-type and DMPK [ - / - ] knockout mice . In vitro-differentiated DMPK [ - / - ] myotubes exhibit a higher resting [ Ca2 + ] i than do wild-type myotubes because of an altered open probability of voltage-dependent l-type Ca2 + and Na + channels . The mutant myotubes exhibit smaller and slower Ca2 + responses upon triggering by acetylcholine or high external K + . In addition , we observed that these Ca2 + transients partially result from an influx of extracellular Ca2 + through the l-type Ca2 + channel . Neither the content nor the activity of Na + / K + ATPase and sarcoplasmic reticulum Ca2 + -ATPase are affected by DMPK absence . In conclusion , our data suggest that DMPK is involved in modulating the initial events of excitation-contraction coupling in skeletal muscle . . 1
[데이터 전처리]
사전 처리된 데이터는 documentation에 지정된 형식과 같이 이미 아래처럼 되어있다
(헤더 전처리)
- [WORD][SPACE][WORD][SPACE][WORD][TAB][LABEL]
- 헤더 행에는 “문장 레이블(sentence label)”이 있으며 제거되어야 한다.
(텍스트 길이 고려)
- 텍스트가 상당히 길기 때문에
max_seq_length
값을 트레이닝 시 고려해야 한다.
👀 헤더 행 제거 코드 보기
이 작업을 수행하는 방법은 여러 가지가 있지만 간단한 변경 작업이므로 우리는 bash 스트림 에디터(sed) 명령어를 활용
!sed 1d $TC3_DATA_DIR/train.tsv > $TC3_DATA_DIR/train_nemo_format.tsv
!sed 1d $TC3_DATA_DIR/dev.tsv > $TC3_DATA_DIR/dev_nemo_format.tsv
!sed 1d $TC3_DATA_DIR/test.tsv > $TC3_DATA_DIR/test_nemo_format.tsv
# Take a look at the tab separated data
# "1" is "positive" and "0" is "negative"
print("*****\ntrain_nemo_format.tsv sample\n*****")
!head -n 3 $TC3_DATA_DIR/train_nemo_format.tsv
print("\n\n*****\ndev_nemo_format.tsv sample\n*****")
!head -n 3 $TC3_DATA_DIR/dev_nemo_format.tsv
print("\n\n*****\ntest_nemo_format.tsv sample\n*****")
!head -n 3 $TC3_DATA_DIR/test_nemo_format.tsv
TC3_DATA_DIR = '/dli/task/data/NCBI_tc-3'
!ls $TC3_DATA_DIR/*.tsv
결과
*****
train_nemo_format.tsv sample
*****
Identification of APC2, a homologue of the adenomatous polyposis coli tumour suppressor . The adenomatous polyposis coli ( APC ) tumour-suppressor protein controls the Wnt signalling pathway by forming a complex with glycogen synthase kinase 3beta ( GSK-3beta ) , axin / conductin and betacatenin . Complex formation induces the rapid degradation of betacatenin . In colon carcinoma cells , loss of APC leads to the accumulation of betacatenin in the nucleus , where it binds to and activates the Tcf-4 transcription factor ( reviewed in [ 1 ] [ 2 ] ) . Here , we report the identification and genomic structure of APC homologues . Mammalian APC2 , which closely resembles APC in overall domain structure , was functionally analyzed and shown to contain two SAMP domains , both of which are required for binding to conductin . Like APC , APC2 regulates the formation of active betacatenin-Tcf complexes , as demonstrated using transient transcriptional activation assays in APC - / - colon carcinoma cells . Human APC2 maps to chromosome 19p13 . 3 . APC and APC2 may therefore have comparable functions in development and cancer . 0
A common MSH2 mutation in English and North American HNPCC families: origin, phenotypic expression, and sex specific differences in colorectal cancer . The frequency , origin , and phenotypic expression of a germline MSH2 gene mutation previously identified in seven kindreds with hereditary non-polyposis cancer syndrome (HNPCC) was investigated . The mutation ( A-- > T at nt943 + 3 ) disrupts the 3 splice site of exon 5 leading to the deletion of this exon from MSH2 mRNA and represents the only frequent MSH2 mutation so far reported . Although this mutation was initially detected in four of 33 colorectal cancer families analysed from eastern England , more extensive analysis has reduced the frequency to four of 52 ( 8 % ) English HNPCC kindreds analysed . In contrast , the MSH2 mutation was identified in 10 of 20 ( 50 % ) separately identified colorectal families from Newfoundland . To investigate the origin of this mutation in colorectal cancer families from England ( n = 4 ) , Newfoundland ( n = 10 ) , and the United States ( n = 3 ) , haplotype analysis using microsatellite markers linked to MSH2 was performed . Within the English and US families there was little evidence for a recent common origin of the MSH2 splice site mutation in most families . In contrast , a common haplotype was identified at the two flanking markers ( CA5 and D2S288 ) in eight of the Newfoundland families . These findings suggested a founder effect within Newfoundland similar to that reported by others for two MLH1 mutations in Finnish HNPCC families . We calculated age related risks of all , colorectal , endometrial , and ovarian cancers in nt943 + 3 A-- > T MSH2 mutation carriers ( n = 76 ) for all patients and for men and women separately . For both sexes combined , the penetrances at age 60 years for all cancers and for colorectal cancer were 0 . 86 and 0 . 57 , respectively . The risk of colorectal cancer was significantly higher ( p 0 . 01 ) in males than females ( 0 . 63 v 0 . 30 and 0 . 84 v 0 . 44 at ages 50 and 60 years , respectively ) . For females there was a high risk of endometrial cancer ( 0 . 5 at age 60 years ) and premenopausal ovarian cancer ( 0 . 2 at 50 years ) . These intersex differences in colorectal cancer risks have implications for screening programmes and for attempts to identify colorectal cancer susceptibility modifiers . 0
Age of onset in Huntington disease : sex specific influence of apolipoprotein E genotype and normal CAG repeat length . Age of onset ( AO ) of Huntington disease ( HD) is known to be correlated with the length of an expanded CAG repeat in the HD gene . Apolipoprotein E ( APOE ) genotype , in turn , is known to influence AO in Alzheimer disease , rendering the APOE gene a likely candidate to affect AO in other neurological diseases too . We therefore determined APOE genotype and normal CAG repeat length in the HD gene for 138 HD patients who were previously analysed with respect to CAG repeat length . Genotyping for APOE was performed blind to clinical information . In addition to highlighting the effect of the normal repeat length upon AO in maternally inherited HD and in male patients , we show that the APOE epsilon2epsilon3 genotype is associated with significantly earlier AO in males than in females . Such a sex difference in AO was not apparent for any of the other APOE genotypes . Our findings suggest that subtle differences in the course of the neurodegeneration in HD may allow interacting genes to exert gender specific effects upon AO. 1
*****
dev_nemo_format.tsv sample
*****
BRCA1 is secreted and exhibits properties of a granin. Germline mutations in BRCA1 are responsible for most cases of inherited breast and ovarian cancer . However , the function of the BRCA1 protein has remained elusive . We now show that BRCA1 encodes a 190-kD protein with sequence homology and biochemical analogy to the granin protein family . Interestingly , BRCA2 also includes a motif similar to the granin consensus at the C terminus of the protein . Both BRCA1 and the granins localize to secretory vesicles , are secreted by a regulated pathway , are post-translationally glycosylated and are responsive to hormones . As a regulated secretory protein , BRCA1 appears to function by a mechanism not previously described for tumour suppressor gene products . . 0
Ovarian cancer risk in BRCA1 carriers is modified by the HRAS1 variable number of tandem repeat (VNTR) locus. Women who carry a mutation in the BRCA1 gene ( on chromosome 17q21 ) , have an 80 % risk of breast cancer and a 40 % risk of ovarian cancer by the age of 70 ( ref . 1 ) . The variable penetrance of BRCA1 suggests that other genetic and non-genetic factors play a role in tumourigenesis in these individuals . The HRAS1 variable number of tandem repeats ( VNTR ) polymorphism , located 1 kilobase ( kb ) downstream of the HRAS1 proto-oncogene ( chromosome 11p15 . 5 ) is one possible genetic modifier of cancer penetrance . Individuals who have rare alleles of the VNTR have an increased risk of certain types of cancers , including breast cancer ( 2-4 ) . To investigate whether the presence of rare HRAS1 alleles increases susceptibility to hereditary breast and ovarian cancer , we have typed a panel of 307 female BRCA1 carriers at this locus using a PCR-based technique . The risk for ovarian cancer was 2 . 11 times greater for BRCA1 carriers harbouring one or two rare HRAS1 alleles , compared to carriers with only common alleles ( P = 0 . 015 ) . The magnitude of the relative risk associated with a rare HRAS1 allele was not altered by adjusting for the other known risk factors for hereditary ovarian cancer ( 5 ) . Susceptibility to breast cancer did not appear to be affected by the presence of rare HRAS1 alleles . This study is the first to show the effect of a modifying gene on the penetrance of an inherited cancer syndrome 0
A novel homeodomain-encoding gene is associated with a large CpG island interrupted by the myotonic dystrophy unstable (CTG)n repeat. Myotonic dystrophy ( DM ) is associated with a ( CTG ) n trinucleotide repeat expansion in the 3-untranslated region of a protein kinase-encoding gene , DMPK , which maps to chromosome 19q13 . 3 . Characterisation of the expression of this gene in patient tissues has thus far generated conflicting data on alterations in the steady state levels of DMPK mRNA , and on the final DMPK protein levels in the presence of the expansion . The DM region of chromosome 19 is gene rich , and it is possible that the repeat expansion may lead to dysfunction of a number of transcription units in the vicinity , perhaps as a consequence of chromatin disruption . We have searched for genes associated with a CpG island at the 3 end of DMPK . Sequencing of this region shows that the island extends over 3 . 5 kb and is interrupted by the ( CTG ) n repeat . Comparison of genomic sequences downstream ( centromeric ) of the repeat in human and mouse identified regions of significant homology . These correspond to exons of a gene predicted to encode a homeodomain protein . RT-PCR analysis shows that this gene , which we have called DM locus-associated homeodomain protein ( DMAHP ) , is expressed in a number of human tissues , including skeletal muscle , heart and brain . 1
*****
test_nemo_format.tsv sample
*****
Clustering of missense mutations in the ataxia-telangiectasia gene in a sporadic T-cell leukaemia. Ataxia-telangiectasia ( A-T ) is a recessive multi-system disorder caused by mutations in the ATM gene at 11q22-q23 ( ref . 3 ) . The risk of cancer , especially lymphoid neoplasias , is substantially elevated in A-T patients and has long been associated with chromosomal instability . By analysing tumour DNA from patients with sporadic T-cell prolymphocytic leukaemia ( T-PLL ) , a rare clonal malignancy with similarities to a mature T-cell leukaemia seen in A-T , we demonstrate a high frequency of ATM mutations in T-PLL . In marked contrast to the ATM mutation pattern in A-T , the most frequent nucleotide changes in this leukaemia were missense mutations . These clustered in the region corresponding to the kinase domain , which is highly conserved in ATM-related proteins in mouse , yeast and Drosophila . The resulting amino-acid substitutions are predicted to interfere with ATP binding or substrate recognition . Two of seventeen mutated T-PLL samples had a previously reported A-T allele . In contrast , no mutations were detected in the p53 gene , suggesting that this tumour suppressor is not frequently altered in this leukaemia . Occasional missense mutations in ATM were also found in tumour DNA from patients with B-cell non-Hodgkins lymphomas ( B-NHL ) and a B-NHL cell line . The evidence of a significant proportion of loss-of-function mutations and a complete absence of the normal copy of ATM in the majority of mutated tumours establishes somatic inactivation of this gene in the pathogenesis of sporadic T-PLL and suggests that ATM acts as a tumour suppressor . As constitutional DNA was not available , a putative hereditary predisposition to T-PLL will require further investigation . . 0
Myotonic dystrophy protein kinase is involved in the modulation of the Ca2+ homeostasis in skeletal muscle cells. Myotonic dystrophy ( DM ) , the most prevalent muscular disorder in adults , is caused by ( CTG ) n-repeat expansion in a gene encoding a protein kinase ( DM protein kinase ; DMPK ) and involves changes in cytoarchitecture and ion homeostasis . To obtain clues to the normal biological role of DMPK in cellular ion homeostasis , we have compared the resting [ Ca2 + ] i , the amplitude and shape of depolarization-induced Ca2 + transients , and the content of ATP-driven ion pumps in cultured skeletal muscle cells of wild-type and DMPK [ - / - ] knockout mice . In vitro-differentiated DMPK [ - / - ] myotubes exhibit a higher resting [ Ca2 + ] i than do wild-type myotubes because of an altered open probability of voltage-dependent l-type Ca2 + and Na + channels . The mutant myotubes exhibit smaller and slower Ca2 + responses upon triggering by acetylcholine or high external K + . In addition , we observed that these Ca2 + transients partially result from an influx of extracellular Ca2 + through the l-type Ca2 + channel . Neither the content nor the activity of Na + / K + ATPase and sarcoplasmic reticulum Ca2 + -ATPase are affected by DMPK absence . In conclusion , our data suggest that DMPK is involved in modulating the initial events of excitation-contraction coupling in skeletal muscle . . 1
Constitutional RB1-gene mutations in patients with isolated unilateral retinoblastoma. In most patients with isolated unilateral retinoblastoma , tumor development is initiated by somatic inactivation of both alleles of the RB1 gene . However , some of these patients can transmit retinoblastoma predisposition to their offspring . To determine the frequency and nature of constitutional RB1-gene mutations in patients with isolated unilateral retinoblastoma , we analyzed DNA from peripheral blood and from tumor tissue . The analysis of tumors from 54 ( 71 % ) of 76 informative patients showed loss of constitutional heterozygosity ( LOH ) at intragenic loci . Three of 13 uninformative patients had constitutional deletions . For 39 randomly selected tumors , SSCP , hetero-duplex analysis , sequencing , and Southern blot analysis were used to identify mutations . Mutations were detected in 21 ( 91 % ) of 23 tumors with LOH . In 6 ( 38 % ) of 16 tumors without LOH , one mutation was detected , and in 9 ( 56 % ) of the tumors without LOH , both mutations were found . Thus , a total of 45 mutations were identified in tumors of 36 patients . Thirty-nine of the mutations-including 34 small mutations , 2 large structural alterations , and hypermethylation in 3 tumors-were not detected in the corresponding peripheral blood DNA . In 6 ( 17 % ) of the 36 patients , a mutation was detected in constitutional DNA , and 1 of these mutations is known to be associated with reduced expressivity . The presence of a constitutional mutation was not associated with an early age at treatment . In 1 patient , somatic mosaicism was demonstrated by molecular analysis of DNA and RNA from peripheral blood . In 2 patients without a detectable mutation in peripheral blood , mosaicism was suggested because 1 of the patients showed multifocal tumors and the other later developed bilateral retinoblastoma . In conclusion , our results emphasize that the manifestation and transmissibility of retinoblastoma depend on the nature of the first mutation , its time in development , and the number and types of cells that are affected . . 2
TC3_DATA_DIR = '/dli/task/data/NCBI_tc-3'
!ls $TC3_DATA_DIR/*.tsv
결과
/dli/task/data/NCBI_tc-3/dev.tsv
/dli/task/data/NCBI_tc-3/dev_nemo_format.tsv
/dli/task/data/NCBI_tc-3/test.tsv
/dli/task/data/NCBI_tc-3/test_nemo_format.tsv
/dli/task/data/NCBI_tc-3/train.tsv
/dli/task/data/NCBI_tc-3/train_nemo_format.tsv
환경 설정 파일(Configuration File)
text_classification_config.yaml config 파일을 확인하고 키와 기본 value값을 확인해 봅니다. 특히, 세가지 최상위 키인 trainer, model,exp_manager와 키의 계층 구조를 확인해 두세요.
trainer: gpus: num_nodes: max_epochs: …
model: nemo_path: tokenizer:
language_model: classifier_head: …
exp_manager: … CONFIG_DIR = “/dli/task/nemo/examples/nlp/text_classification/conf” CONFIG_FILE = “text_classification_config.yaml” !cat $CONFIG_DIR/$CONFIG_FILE 2.2.2.1 OmegaConf 도구 YAML config 파일은 파라미터 대부분의 기본 값을 제공하지만 텍스트 분류 실험을 진행하기 위해 지정되어야하는 몇 개의 파라미터가 있습니다.
각각의 YAML 섹션은 omegaconf 패키지를 이용하면 보다 더 쉽게 확인할 수 있으며, “dot” 프로토콜을 활용하는 구성 키에 접근하고 조작할 수 있습니다.
Config 파일에서 OmegaConf 객체를 인스턴스화하는 것으로부터 시작합니다. 객체에 있는 키는 변경, 추가, 보기, 저장 등의 작업을 수행할 수 있습니다.
예를 들어, model섹션만 보려고 하는 경우, 우리는 config 파일을 로드하여 config.model 섹션만 지정하여 프린트문을 통해 볼 수 있습니다.
from omegaconf import OmegaConf config = OmegaConf.load(CONFIG_DIR + “/” + CONFIG_FILE) print(OmegaConf.to_yaml(config.model)) 모델 인자(argument)에 대한 세부 사항은 다음 문서에서 확인할 수 있습니다. 이 실습에서는 dataset.num_classes 값과 train_ds.file_path, val_ds.file_path, test_ds.file_path 에 있는 위치 값이 필요합니다.
메모리가 부족하지 않도록 dataset.max_seq_length 값을 128로 제한할 수 있습니다. infer_samples(추측 샘플)이 영화 평점과 관련이 있는 것처럼 보일 수 있으므로 우리는 질병 도메인에서 의미 있는 문장으로 변경할 수 있습니다.
나중에 변경해야할 여러가지 파라미터가 있지만 현재로서는 우리가 반드시 제공해야하는 모든 파라미터는 위와 같습니다.
다음으로 trainer(트레이너) 하위 섹션을 살펴보겠습니다.
print(OmegaConf.to_yaml(config.trainer)) 현재 GPU가 하나 뿐이기 때문에, 이 설정은 괜찮긴 하지만 우선 max_epochs를 몇 개로 제한해야 할 수도 있습니다. model구성과 마찬가지로 몇 가지 다른 파라미터를 변경할 수도 있지만, 처음에는 기본값으로 가보도록 하겠습니다.
마지막으로, exp_manager는 어떻습니까?
print(OmegaConf.to_yaml(config.exp_manager)) 이번 섹션도 이대로 괜찮습니다. exp_dir 값이 null이라면 nemo_experiments라는 새 디렉터리에 실험 결과는 기본으로 배치될 예정입니다.
2.2.3 Hydra-Enabled Python 스크립트 정리를 하면, 우리가 변경 또는 재정의하고자 하는 파라미터는 다음과 같습니다.:
model.dataset.num_classes: 3 model.dataset.max_seq_length: 128 model.train_ds.file_path: train_nemo_format.tsv model.val_ds.file_path: dev_nemo_format.tsv model.test_ds.file_path: test_nemo_format.tsv model.infer_samples : relevent sentences trainer.max_epochs: 3 텍스트 분류 트레이닝 스크립트를 이용해서 우리는 하나의 명령문 만으로 트레이닝, 추론, 테스트를 모두 진행할 수 있습니다!
스크립트는 Hydra를 이용하여 config 파일을 관리하므로 우리는 아래와 같이 커맨드 라인에서 직접 원하는 값을 다음과 같이 재정의할 수 있습니다.
TC_DIR = “/dli/task/nemo/examples/nlp/text_classification”
%%time
The training takes about 2 minutes to run
TC_DIR = “/dli/task/nemo/examples/nlp/text_classification”
set the values we want to override
NUM_CLASSES = 3 MAX_SEQ_LENGTH = 128 PATH_TO_TRAIN_FILE = “/dli/task/data/NCBI_tc-3/train_nemo_format.tsv” PATH_TO_VAL_FILE = “/dli/task/data/NCBI_tc-3/dev_nemo_format.tsv” PATH_TO_TEST_FILE = “/dli/task/data/NCBI_tc-3/test_nemo_format.tsv”
disease domain inference sample answers should be 0, 1, 2
INFER_SAMPLES_0 = “In contrast no mutations were detected in the p53 gene suggesting that this tumour suppressor is not frequently altered in this leukaemia “ INFER_SAMPLES_1 = “The first predictive testing for Huntington disease was based on analysis of linked polymorphic DNA markers to estimate the likelihood of inheriting the mutation for HD” INFER_SAMPLES_2 = “Further studies suggested that low dilutions of C5D serum contain a factor or factors interfering at some step in the hemolytic assay of C5 rather than a true C5 inhibitor or inactivator” MAX_EPOCHS = 3
Run the training script, overriding the config values in the command line
!python $TC_DIR/text_classification_with_bert.py
model.dataset.num_classes=$NUM_CLASSES
model.dataset.max_seq_length=$MAX_SEQ_LENGTH
model.train_ds.file_path=$PATH_TO_TRAIN_FILE
model.validation_ds.file_path=$PATH_TO_VAL_FILE
model.test_ds.file_path=$PATH_TO_TEST_FILE
model.infer_samples=[“$INFER_SAMPLES_0”,”$INFER_SAMPLES_1”,”$INFER_SAMPLES_2”]
trainer.max_epochs=$MAX_EPOCHS 각 트레이닝 실험이 시작할때, 커맨트 라인을 통해 추가되거나, 재정의된 파라미터를 포함하여 실험 사양(specification)의 로그가 프린트됩니다. 또한, 사용 가능한 GPU , 로그 저장 위치 및 모델에 대한 해당 입력과 함께 데이터세트로부터 일부 샘플과 같은 추가 정보도 함께 표시됩니다. 로그는 데이터세트에 있는 시퀀스 길이에 대한 일부 통계 값도 제공합니다.
각 epoch 후, 정밀도, 리콜, f1 점수를 포함한 검증 세트에 대한 메트릭 요약 테이블이 있습니다. f1 점수는 모든 false positive(위양성)와 false negative(위음성)을 모두 소고려하여 단순 정확도보다 더 유용한 것으로 간주됩니다.
트레이닝이 끝나면, NeMo는 model.nemo_file_path에서 지정한 경로에 마지막 체크포인트를 저장합니다. 이 값을 기본 값으로 두었기 때문에 .nemo 형식으로 워크스페이스에 기록될 예정입니다..
!ls *.nemo 실험에서 얻은 결과가 그렇게 좋지 않으셨을 수 도 있습니다. 하지만 몇 가지 변경만으로 다른 실험을 시도하는 것은 매우 쉽습니다. 더 긴 트레이닝, 조정된 학습 속도, 트레이닝와 검증 데이터 세트의 배치 크기를 변경하면 결과를 개선할 수 있습니다.
2.2.4 예제: Experiment 수행하기 동일한 텍스트 분류 문제를 사용하여 다른 유사한 실험을 진행해 봅니다. 이번에는 몇 가지 개선 사항이 제안되었습니다. :
혼합 정밀도를 amp_level 를 “O1” 로 16의 precision(정밀도) 로 설정하여 정확도에 있어 거의 또는 전혀 저하되지 않고 모델을 더 빠르게 실행할 수 있도록 합니다. Epoch(에포크) 수를 약간 위로 조정하여 결과를 개선합니다. (더 큰 max_epochs) 학습 속도를 약간 높여 모델 가중치를 업데이트할 때 마다 예상 오류에 대해 보다 신속하게 대응할 수 있도록 합니다. 아래 셀에 여러분을 위해 새로운 값이 제공되었습니다. 명령을 적절한 재정의와 함께 추가하고 셀을 실행합니다. 만약에 실습 진행에 어려움이 있으신 경우, 솔루션을 참고 합니다.
%%time
The training takes about 2 minutes to run
TC_DIR = “/dli/task/nemo/examples/nlp/text_classification”
set the values we want to override
NUM_CLASSES = 3 MAX_SEQ_LENGTH = 128 PATH_TO_TRAIN_FILE = “/dli/task/data/NCBI_tc-3/train_nemo_format.tsv” PATH_TO_VAL_FILE = “/dli/task/data/NCBI_tc-3/dev_nemo_format.tsv” PATH_TO_TEST_FILE = “/dli/task/data/NCBI_tc-3/test_nemo_format.tsv”
disease domain inference sample answers should be 0, 1, 2
INFER_SAMPLES_0 = “In contrast no mutations were detected in the p53 gene suggesting that this tumour suppressor is not frequently altered in this leukaemia “ INFER_SAMPLES_1 = “The first predictive testing for Huntington disease was based on analysis of linked polymorphic DNA markers to estimate the likelihood of inheriting the mutation for HD” INFER_SAMPLES_2 = “Further studies suggested that low dilutions of C5D serum contain a factor or factors interfering at some step in the hemolytic assay of C5 rather than a true C5 inhibitor or inactivator” MAX_EPOCHS = 5 AMP_LEVEL = ‘O1’ PRECISION = 16 LR = 5.0e-05
Override the config values in the command line
FIXME
이번 실험 결과가 이전 실험과 비교 했을 때 어떠셨나요? 출력 값에서 F1 점수와 추론 결과를 확인해 봅니다.
2.2.5 TensorBoard로 결과 시각화하기 experiment manager 는 텐서보드로 볼 수 있는 결과들을 저장합니다. 다음 셀을 실행하여 인스턴스에 대한 텐서보드 링크를 만든 다음 링크를 클릭하여 브라우저의 탭에서 텐서보드를 엽니다.
%%js const href = window.location.hostname +’/tensorboard/’; let a = document.createElement(‘a’); let link = document.createTextNode(‘Open Tensorboard!’); a.appendChild(link); a.href = “http://” + href; a.style.color = “navy” a.target = “_blank” element.append(a); 실행한 모델들의 성능을 비고하려면 “Train Loss” 스케일러를 선택합니다. 여러분은 함께 실행한 모든 모델을 보실 수도 있고, 개별 모델을 선택하여 비교할 수 있습니다. 아래 예제는 첫 번째 주황색 실험 결과와 두 번째 파란색 예제 결과를 보여줍니다. 이를 통해 두 번째 실험에서 손실이 더 적었음을 알 수 있습니다.
Image 2.2.6 예제: 언어 모델 변경하기 지금까지는 기본적인 bert-base-uncased 언어 모델을 사용해 왔지만, 이 모델은 여러분이 시도해볼 수 있는 모델 중 하나일 뿐입니다. 다음 셀을 실행하여 사용 가능한 언어 모델들을 확인해 보세요.
complete list of supported BERT-like models
from nemo.collections import nlp as nemo_nlp nemo_nlp.modules.get_pretrained_lm_models_list() 이번 예제에서는 megatron-bert-345m-cased와 같은 새로운 언어 모델을 선택합니다.
메모리를 지우기 위해 노트북 커널을 새로 시작하셔야 할 수도 있습니다. 대형 모델을 사용할 경우, GPU 메모리 공간을 절약하는 다른 방법은 batch_size 를 32, 16 또는 8로 줄이고 max_seq_length를 64로 줄이는 것입니다. 이 예제에서는 정답이 따로 없습니다. 오히려 이번 예제는 여러분이 여러 실험을 해볼 수 있는 기회입니다.일부 모델은 실행하는 데 몇 분이 소요될 수 있으므로 다음 노트북으로 먼저 이동한 후 나중에 시간이 나면 다시 여기로 돌아오셔도 좋습니다. 실습 진행에 어려움이 있는 경우, 예제 솔루션 을 살펴보십시오. 이 모델에 대한 손실값 및 f1 결과 값을 기록해놓거나 텐서보드를 통해 차이점을 시각화해보는 것을 잊지 마세요.
Restart the kernel
import IPython app = IPython.Application.instance() app.kernel.do_shutdown(True)