Gensim 사용법[정리]

cdjiwon 2022. 9. 4. 17:16

How to use Gensim

Document : 일부 텍스트

텍스트 시퀀스 유형의 객체 (뉴스기사, 책 등)
문장

document="Human machine interface for lab abc computer applications"

Corpus : 문서 모음

1의 모음이 Corpus 말뭉치이다.

text_corpus=["Human machine interface for lab abc computer applications","A survey of user opinion of computer system response time","The EPS user interface management system","System and human system engineering testing of EPS","Relation of user perceived response time to error measurement","The generation of random binary unordered trees","The intersection graph of paths in trees","Graph minors IV Widths of trees and well quasi ordering","Graph minors A survey",]

Corpora의 두 가지 역할

첫 번째 : 모델 훈련을 위한 입력이다.
훈련 중 모델은 Corpus를 이용하여 공통 주제와 주제를 찾고 내부 모델 매개변수를 초기화한다.
두 번째 : 문서들을 정리하기 위해서.
훈련 후 토픽 모델을 사용하여 새 문서에 주제를 추출할 수 있다.
이러한 말뭉치는 유사성 쿼리에 대해 인덱싱되고, 의미론적 유사성, 클러스터링 등으로 쿼리될 수 있다.

from gensim import corpora

dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)

# Output
Dictionary(12 unique tokens : ['computer', 'human', ...] ...)

일반적으로 사전은 수십만 개의 토큰이 포함되어있다.

Vector : 문서의 수학적으로 편리한 표현

Corpus의 잠재적 구조를 추론하려면 수학적으로 조작할 수 있는 문서를 나타내는 방법이 필요하다.
방법 1 : 각 문서를 기능 벡터로 표현한다.
- 예) 단일 기능은 질문-답변 쌍으로 생각한다.
- Output : (1, 0.0), (2, 2.0), (3, 5.0) [질문의 번호, 답변]
방법 2 : bag-of-words 모델 이용
- 주요 속성 : 인코딩된 문서의 토큰을 완전히 무시한다.
- 예) 사전에 있는 각 단어의 빈도수를 포함하는 벡터로 표시
- 단어들의 출현 빈도에만 집중하는 텍스트 데이터의 수치화 표현 방법이다.

결과값 예 : [(0,1), (1,1)]

3번을 수행한 OUTPUT

[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]

문서와 벡터의 차이점

문서 : 텍스트
벡터 : 수학적으로 편리하게 표현한 것

Model : 벡터를 한 표현에서 다른 표현으로 변환하는 알고리즘

Corpus를 벡터화했으므로 모델을 사용하여 Corpus를 변환할 수 있다.
tf-idf는 단어가 문서에 얼마나 중요한지를 반영하기 위한 수치 통계이다.
tf-idf 모델은 단어 백(bag-of-words) 표현의 벡터를 코퍼스에서 각 단어의 상대적 희귀도에 따라 빈도 수가 가중되는 벡터 공간으로 변환한다.

tf-idf 모델 예시

from gensim import models

# train the model
tfidf = models.TfidfModel(bow_corpus)

# transform the "system minors" string
words = "system minors".lower().split()
print(tfidf[dictionary.doc2bow(words)])

# OUTPUT
[(5, 0.5897845), (11,0.84843123)]

문서 유사성 쿼리

# tf-idf를 이용하여 전체 말뭉치를 변환하고 색인을 생성할 수 있다.
from gensim import similarities index = similarities.SparseMatrixSimilarity(tfidf[bow_corpus], num_features=12)
# 특정 코퍼스의 모든 문서에 대해 유사성을 쿼리하기 위한 수행
query_document = 'system engineering'.split() query_bow = dictionary.doc2bow(query_document) sims = index[tfidf[query_bow]] print(list(enumerate(sims)))
# OUTPUT
[(0, 0.0), (1, 0.32448703), (2, 0.41707572), (3, 0.7184812), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]

Gensim 사용 절차

Document 불러오기/만들기
Corpus로 만들기
Corpus 전처리하기
gensim.corpora.Dictionary를 이용하여 말뭉치의 각 단어를 고유 정수 ID와 연결하기
모델 만들기
만들어진 모델을 사용하여 일부 쿼리 문서와 코퍼스의 모든 문서 간의 유사성 계산

사전 저장하기

from gensim import corpora  
dictionary = corpora.Dictionary(texts)  
dictionary.save('/tmp/deerwester.dict') # store the dictionary, for future reference  
print(dictionary)

저작자표시

'AI' 카테고리의 다른 글

Gensim Functions (0)	2022.09.04
Gensim 활용 LDA 모델 학습 (0)	2022.08.04
Yolov5 Window Environment (0)	2022.08.04
Tensorflow 공부 (2차) (0)	2022.02.22
Tensorflow 공부 (1차) (0)	2022.02.22

현재글Gensim 사용법[정리]

Codding jiwon

IT 의 경험을 모아두는 블로그입니다.

PyTorch, 웹 소켓, LDA, YOLOv5, LdaModel, cra법, 공급망 보안, Visual Genome, Node.js, AI Server, 신입사원, eo 14028, 벤처기업협회, TopicModel, DarkLabel, Gensim, cnn, #AJAX, AI, YOLO,

Today :
Yesterday :

Codding jiwon

Gensim 사용법[정리]

How to use Gensim

3번을 수행한 OUTPUT

문서와 벡터의 차이점

tf-idf 모델 예시

Gensim 사용 절차

사전 저장하기

'AI' 카테고리의 다른 글

'AI'의 다른글

티스토리툴바

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Gensim 사용법[정리]

How to use Gensim

3번을 수행한 OUTPUT

문서와 벡터의 차이점

tf-idf 모델 예시

Gensim 사용 절차

사전 저장하기

'AI' 카테고리의 다른 글

'AI'의 다른글

관련글

티스토리툴바