8장: 웹 스크래핑 코드

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

대학원 일기

8장: 웹 스크래핑 코드 본문

School/Data Mining and Statistics

8장: 웹 스크래핑 코드

대학원생(노예) 2023. 10. 16. 01:30

# 8.3 공공데이터포털의 목록 추출(단일 페이지)
# 패키지 설치 및 로딩
install.packages("rvest")
library(rvest)

# 웹문서 읽기
url <- "<https://www.data.go.kr/tcs/dss/selectDataSetList.do>"
html <- read_html(url)
html

# 목록 아이템 추출
title <- html_nodes(html, "#apiDataList .title") %>%
  html_text()
title

# 목록 아이템 설명 추출
desc <- html_nodes(html, "#apiDataList .ellipsis") %>%
  html_text()
desc

# 데이터 정제: 제어문자를 공백으로 대체
title <- gsub("[|\\r|\\n|\\t]", "", title)
title

# 데이터 출력
api <- data.frame(title, desc)
api

# 8.4 네이버 영화리뷰 추출 (단일 페이지)

# 패키지 설치
install.packages("rvest")
library(rvest)

# 웹문서 읽기
url <- "<https://movie.naver.com/movie/point/af/list.nhn>"
html <- read_html(url)
html

# 리뷰 셀 추출 
review_cell <- html_nodes(html, "#old_content table tr .title")
review_cell

# 평점 추출 
score <- html_nodes(review_cell, "em") %>%
  html_text()
score

# 리뷰 추출 
review <- review_cell %>% 
  html_text()
review

# 리뷰 데이터 정제 
# (1) 리뷰 앞 공통부분이 있는 위치
index.start <- regexpr("\\t별점 -", review)
index.start
# (1) 리뷰 뒤 공통부분이 있는 위치
index.end   <- regexpr("\\t신고", review)
index.end
# (2) 리뷰 추출 
review <- substring(review, index.start, index.end)
review
review <- substring(review, 16)
review
# (3) 제어문자 제거(제어문자를 공백으로 대체)
review <- gsub("[|\\n|\\t]", "", review)
review
# (4) 리뷰 좌우 공백 제거
review <- trimws(review, "both")
review

'School > Data Mining and Statistics' 카테고리의 다른 글

10장: Naver OPEN API (0)	2023.10.16
9장: OPEN API (0)	2023.10.16
7장: 지도 활용 코드 (0)	2023.10.16
연속 확률 분포 (0)	2023.10.16
이산형 확률 분포 (0)	2023.10.16

'School/Data Mining and Statistics' Related Articles

Comments

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

대학원 일기

대학원 일기

8장: 웹 스크래핑 코드 본문

8장: 웹 스크래핑 코드

'School > Data Mining and Statistics' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역