파이썬 웹 크롤러? 웹 스크래퍼 만들기 - 2

Captain BIN
2020. 12. 8. 02:48

파이썬 웹 크롤러

requests 익히기

requests 모듈에서 웹 크롤러를 제작하는데 필요한 최소한의 사용법을 알아보겠습니다. 웹 서버로 데이터를 요청하는 방식은 GET방식과 POST방식이 있는 것은 잘 아실 겁니다. 간단히 말해 웹브라우저에서 URL부분에 주소 맨 뒤에? 와 함께 변수=값&변수=값과 같은 형태가 보인다면 GET형식입니다. 반면 깔끔한 주소만 보인다면 POST방식입니다.

하위 내용은 get 형태와 post 형태의 기본적인 형태와 데이터를 보내는 방법을 작성하였습니다. url부분은 자주 방문하시는 사이트 주소로 테스트해보시면 좋을 것 같습니다. 그리고 상황에 맞춰 데이터를 가져오면 req_dt변수의 값을 가공하는 방법은 별도로 서술했습니다.

get 요청 - 기본 형태

import requests as req

url = "https://www.????.co.kr"

req_dt = req.get(url)

get 요청 - 쿼리스트링 (1)

import requests as req

url = "https://www.????.co.kr"

req_dt = req.get(url, params={"key1":"value1","key2":"value2"})

get 요청 - 쿼리스트링 (2)

import requests as req

url = "https://www.????.co.kr//?key1=value1&key2=value2"

req_dt = req.get(url)

post 요청 - 기본 형태

import requests as req

url = "https://www.????.co.kr"

req_dt = req.post(url)

post 요청 - 데이터 전송

import requests as req

url = "https://www.????.co.kr"

req_dt = req.post(url, data={"key1":"value1","key2":"value2"})

post 요청 - 데이터 전송 (json 사용)

import requests as req
import json

url = "https://www.????.co.kr"

req_dt = req.post(url, data=json.dumps({"key1":"value1","key2":"value2"})

헤더 설정

import requests as req

url = "https://www.????.co.kr"

req_dt = req.get(url, headers={"User-Agent":"Mozilla/5.0 (Machintosh;Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"})

req_dt 변수 데이터 가공

이상으로 위에서 GET과 POST방식으로 웹 서버에 요청하여 받은 데이터를 req_dt 변수에 담은 이후 어떻게 데이터를 추출하는지 알아보겠습니다. 뭐 특별한 것은 없기에 결과 값은 생략하겠습니다.

# 응답코드 확인
print(req_dt)

# 응답코드 번호만 출력
print(req_dt.status_code)

# 헤더 정보 확인
print(req_dt.headers)

# 쿠키정보 - 1
print(req_dt.headers['Set-Cookie'])

# 쿠키정보 - 2
print(req_dt.cookies)

# HTML 정보 가져오기 (TEXT 형태)
# 데이터가 깨질 수 있음
print(req_dt.text)

# HTML 정보 가져오기 (바이너리 형태)
print(req_dt.content)

# 페이지 인코딩 정보 확인
print(req_dt.encoding)

requests exceptions

requests 사용 시 발생하는 오류들을 정리해 보았습니다. 오류가 발생하지 않으면 가장 좋겠지만 만약 오류가 발생한다면 그 각 오류 메시지에 해당하는 설명은 다음과 같습니다.

Error	설명
RequestException	요청 처리 에러
HTTPError	HTTP 에러
ConnectionError	연결 오류
ProxyError	프록시 오류
SSLError	SSL 인증서 오류
Timeout	요청시간 초과 (ConnectTimeout 과 ReadTimeout 같이 발생)
ConnectTimeout	서버 응답시간 초과
ReadTimeout	서버 데이터 전송 시간 초과
URLRequired	유효한 URL이 아닐경우
TooManyRedirects	리다이렉션이 너무 많을 경우 (새로고침 경우)
MissingSchema	http 또는 https가 누락
InvalidSchema	유효한 스키마는 defaults.py를 참조하라 하는데 못찾겠음.
InvalidURL	제공된 URL이 잘못 되었을 경우
InvalidHeader	헤더가 잘못 되었을 경우
InvalidProxyURL	제공된 ProxyURL이 잘못 되었을 경우
ChunkedEncodingError	서버가 청크 인코딩을 선언했지만 잘못된 청크를 보낼 경우
ContentDecodingError	응답 내용을 디코딩하지 못할 경우
StreamConsumedError	이미 사용된 콘텐츠일 경우
RetryError	재요청 실패
UnrewindableBodyError	본문을 다시 읽기 실패
RequestsWarning	요청에 대한 기본 경고
FileModeWarning	파일을 텍스트 모드로 열었지만 바이너리 파일을 요청할 때
RequestsDependencyWarning	가져온 종속성이 예상 버전 범위가 아닐 경우

urllib 모듈 맛보기

이전 포스팅 글에서 requests 모듈과 비슷한 역할을 하는 내장 모듈 urllib가 있다고 말씀드렸습니다. 파이썬 웹 크롤러를 제작하는 데 사용하지는 않지만 urllib를 살펴보지 않으면 urllib 모듈이 섭섭해할 것 같아 간단히 맛보겠습니다.

from urllib.request import urlopen, Request

url = "https://www.????.co.kr"

# 객체 생성 - 기본형태
req_hd = Request(url)

# 객체 생성 - post 요청시
data = {'key1':'value1','key2','value2'}
data = urllib.parse.urlencode(data)
data = data.encode('utf-8')
req_hd = Request(url, data=data, headers={})

# 객체 생성 - get 요청시
req_hd = Request(url+"?key1=value1&key2=value2", None, headers={})

# 요청
req_dt = urlopen(req_hd)

# 데이터 가공
print(req_dt)
print(req_dt.code)
print(req_dt.headers)
print(req_dt.url)
print(req_dt.info().get_content_charset())
print(req_dt.read())

BeautifulSoup - bs4 익히기

bs는 BeautifulSoup의 약어입니다. 한글로 직역하자면 아름다운 수프 정도가 되겠습니다. 굳이 의미를 부여하자면 수프를 수저로 떠서 먹듯이 복잡한 html 구문을 이쁘게 정리해서 필요한 부분만 쏙 빼낸다는 의미일 것입니다.

bs4에서는 사용할 수 있는 파서가 3가지가 있습니다. lxml, html5lib, html.parser입니다. 이 중 lxml만 사용하시면 됩니다. html5lib 파서는 파이썬 2 버전용입니다. html.parser는 파이썬 3 용이기는 하지만 최신 버전에서는 동작하지 않습니다. 고로 lxml 파서 위주로 사용법을 익혀 보겠습니다.

참고로 사용법을 익히는 것이므로 코드 및 결과 값이 길어지는 것을 방지하기 위해 짧은 html 코드를 사용하겠습니다.

bs4 코드에서 사용한 html 예제 코드는 다음과 같습니다. 태그 이름들과 구조만 눈여겨보시면 좋을 듯싶습니다.

<!DOCTYPE html>
<html lang="ko">
    <head>
        <style>
.class_div_1{background-color:rgb(121, 121, 121);}
#id_div_1{color:rgb(187, 68, 25);}
.class_div_2{background-color:rgb(253, 255, 128);}
#id_div_2{color:rgb(18, 199, 42);}
        </style>
        <title>bs4 test</title>
    </head>
    <body>
        <div class="class_div_1" id='id_div_1'>
            <p class='class_p_1_1' id='id_p_1_1'>div_1.p tag1</p>
            <p> Captain </p>
            <p class='class_p_1_2' id='id_p_1_2'>div_1.p tag2</p>
            <p> BIN </p>
            <p class='class_p_1_3' id='id_p_1_3'>div_1.p tag3</p>
            <p> BLOG </p>
        </div> 
        <div class="class_div_2" id='id_div_2'>
            <p class='class_p_2_1' id='id_p_2_1'>div_2.p tag1</p>
            <p>exam 2-1</p>
            <p class='class_p_2_2' id='id_p_2_2'>div_2.p tag2</p>
            <p>exam 2-2</p>
            <p class='class_p_2_3' id='id_p_2_3'>div_2.p tag3</p>
            <p>exam 2-3</p>
        </div>
    </body>
</html>

BeautifulSoup 기본사항

아래 exam_html 변수에 들어있는 html 코드는 바로 위의 html 코드와 동일합니다. BeautifulSoup을 활용해 lxml파서로 이쁘게 정리한 데이터를 spoon 변수에 넣는 코드입니다. 사용하기 참 간단하죠? 그럼 이제 이 spoon 변수에 이쁘게 담겨 있는 데이터들을 어떻게 활용하는지 확인해 보겠습니다.

from bs4 import BeautifulSoup as bs

exam_html = "<!DOCTYPE html><html lang='ko'><head><style>.class_div_1{background-color:rgb(121, 121, 121);}#id_div_1{color:rgb(187, 68, 25);}.class_div_2{background-color:rgb(253, 255, 128);}#id_div_2{color:rgb(18, 199, 42);}</style><title>bs4 test</title></head><body><div class='class_div_1' id='id_div_1'><p class='class_p_1_1' id='id_p_1_1'>div_1.p tag1</p><p> Captain </p><p class='class_p_1_2' id='id_p_1_2'>div_1.p tag2</p><p> BIN </p><p class='class_p_1_3' id='id_p_1_3'>div_1.p tag3</p><p> BLOG </p></div> <div class='class_div_2' id='id_div_2'><p class='class_p_2_1' id='id_p_2_1'>div_2.p tag1</p><p>exam 2-1</p><p class='class_p_2_2' id='id_p_2_2'>div_2.p tag2</p><p>exam 2-2</p><p class='class_p_2_3' id='id_p_2_3'>div_2.p tag3</p><p>exam 2-3</p></div></body></html>"

spoon = bs(exam_html, 'lxml')

lxml파서로 파싱 한 내용 보기

결과 값은 내용만 길어서 생략하겠습니다. spoon 자체를 print 해보면 html 코드와 동일하게 출력됩니다. 이 변수에 prettify()를 사용하게 되면 마음에 조금 안 들기는 하지만 들여 쓰기와 라인 개행이 적절히 이루어져 이쁜 출력물을 확인할 수 있습니다.

print(spoon)
print("=" * 100)
print(spoon.prettify())

타이틀 확인하기 - 태그 포함

print(spoon.title)

#결과
<title>bs4 test</title>

타이틀 확인하기 - 태그 제거

결과가 동일하기는 하지만 타입이 다릅니다.

print(spoon.title.text)
print(spoon.title.string)
print(type(spoon.title.text), "|" ,type(spoon.title.string))

# 결과
bs4 test
bs4 test
<class 'str'> | <class 'bs4.element.NavigableString'>

유용한 텍스트만 추출

spoon 변수에 text를 이용하여 추출하면 태그로 감싸져 있는 모든 텍스트 들을 추출할 수 있습니다. string은 정확한 위치에 있는 태그에 속한 텍스트만을 추출하는 차이가 있습니다. 예시 중 첫 번째 사용한 stirng은 정확한 위치를 지정하지 않아 'None'값을 출력했습니다. 두 번째 사용한 string에서는 p태그를 선택하였습니다. 즉, html 문서에서 가장 맨 처음에 있는 p태그에 속한 텍스트가 추출되는 것을 확인할 수 있습니다.

print(spoon.text)
print("-" * 20)
print(spoon.string)
print("-" * 20)
print(spoon.p.string)

# 결과
bs4 testdiv_1.p tag1 Captain div_1.p tag2 BIN div_1.p tag3 BLOG  div_2.p tag1exam 2-1div_2.p tag2exam 2-2div_2.p tag3exam 2-3
--------------------
None
--------------------
div_1.p tag1

콘텐츠 확인하기

spoon 변수의 div 태그를 구성하는 콘텐츠를 확인합니다. 단, div가 여러 개 있음에도 맨 처음 존재하는 div의 콘텐츠 내용을 반환합니다. 반환 형태가 리스트임을 확인할 수 있습니다.

print(spoon.div.contents)

# 결과
[<p class="class_p_1_1" id="id_p_1_1">div_1.p tag1</p>, <p> Captain </p>, <p class="class_p_1_2" id="id_p_1_2">div_1.p tag2</p>, <p> BIN </p>, <p class="class_p_1_3" id="id_p_1_3">div_1.p tag3</p>, <p> BLOG </p>]

태그 속성 확인하기

spoon 변수에 있는 html 코드 중 div 태그에 attrs를 사용하면 dict 자료형으로 데이터를 확인할 수 있습니다.

print(spoon.div.attrs)
print(spoon.div['class'])
print(spoon.div['id'])
print(spoon.div.get('class'))
print(spoon.div.get('id'))

#결과
{'class': ['class_div_1'], 'id': 'id_div_1'}
['class_div_1']
id_div_1
['class_div_1']
id_div_1

자식 태그 확인하기

div 태그에 속해있는 자식 태그를 확인하였습니다. 단, 결과 값이 리스트 형태 임으로 for문을 활용하여 결과를 출력해 보았습니다.

print(spoon.div.children)

print("="*20)

for i in spoon.div.children:
    print(i)

# 결과
<list_iterator object at 0x02DC0D90>
====================
<p class="class_p_1_1" id="id_p_1_1">div_1.p tag1</p>
<p> Captain </p>
<p class="class_p_1_2" id="id_p_1_2">div_1.p tag2</p>
<p> BIN </p>
<p class="class_p_1_3" id="id_p_1_3">div_1.p tag3</p>
<p> BLOG </p>

부모 태그와 부모 태그들 확인하기

결과가 길어 생략하겠습니다. parent를 사용한 결과는 p태그가 속한 부모의 하위 태그까지 보여줍니다. 즉, 부모와 자신을 포함한 형제 정보들까지 확인 가능합니다. 단, parent를 사용하면 바로 상위 부모까지만 확인 가능합니다.

하지만 parents를 사용하면 부모와 조부모 조조부모... 결국 최 상위 태그까지 확인할 수 있습니다.

print(spoon.p.parent)
print("=" * 10)
print(spoon.p.parents)
print("=" * 10)
for i in spoon.p.parents:
    print(i)
    print("-" * 20)

형제 태그 확인하기

curr_tag = spoon.p
next_tag = curr_tag.next_sibling
prev_tag = next_tag.previous_sibling

print(curr_tag)
print(next_tag)
print(prev_tag)

# 결과
<p class="class_p_1_1" id="id_p_1_1">div_1.p tag1</p>
<p> Captain </p>
<p class="class_p_1_1" id="id_p_1_1">div_1.p tag1</p>

다음 및 이전 요소 확인하기

지정한 태그 다음 및 이전의 요소들을 확인합니다. next_element는 한건만 출력 next_elements는 끝까지 모든 요소들을 출력합니다. 출력 결과는 내용이 많아 생략하겠습니다.

print(spoon.p.next_element)

for i in spoon.p.next_elements:
    print(i)
    
print(spoon.p.previous_element)

for i in spoon.p.previous_elements:
    print(i)

지금까지 BeautifulSoup의 기본적인 사용법을 알아보았습니다. 하지만 살펴보면서 뭔가 좀 부족하다는 느낌이 많이 드셨을 겁니다. 즉, 잘 사용할 거 같지는 않지만 그래도 이런 기능이 있다는 것만 알아 두시면 될 것 같습니다.

그럼 이제부터 정말 많이 사용하는 bs4 함수에 대해 알아보겠습니다.

find_all() , find() 함수

예제의 결과 값은 직접 확인해 보세요. 예시로는 id와 class만을 예로 들었지만 img, src 등의 값도 사용할 수 있습니다. 참고로 하위 class를 찾을 때 언더바를 주의하세요. 파이썬 예약어에 class가 있으므로 충돌 방지를 위해 언더바를 붙여 충돌을 회피합니다.

참고로 find_all()과 사용법이 비슷한 find() 함수는 함수 명에서 알 수 있듯이 하나의 요소만을 가져옵니다. 특별한 것이 없으므로 하위 예제에서 find() 함수의 사용법은 제외합니다.

# p 태그만 출력하기
print(spoon.find_all('p'))

# class 명이 class_p_1_3 인 p 태그 출력하기 - 1
print(spoon.find_all('p', class_='class_p_1_3'))

# class 명이 class_p_1_3 인 p 태그 출력하기 - 2
print(spoon.find_all('p', 'class_p_1_3'))

# 해당 id값인 div 태그 출력하기 - 1
print(spoon.find_all('div', id='id_div_2'))

# 해당 id값인 div 태그 출력하기 - 2
print(spoon.find_all(id='id_div_2'))

# p 태그를 2건만 출력
print(spoon.find_all('p', limit=2))

# P 태그 중 ' Captain '값을 가진 태그 출력
# text는 공백까지 정확히 일치해야 함.
print(spoon.find_all('p', text=' Captain '))

# 나열한 태그를 출력하면서 limit
print(spoon.find_all(['title', 'p'], limit=3))

select() 함수

css셀렉터처럼 지정하여 원하는 태그를 추출할 수 있습니다.

# 클래스 명이 class_div_2인 태그 정보
print(spoon.select('.class_div_2'))

# class_div_2 태그의 하위 class_p_2_3 태그 정보 
print(spoon.select('.class_div_2 .class_p_2_3'))

# id 값이 id_p_1_3인 태그 정보
print(spoon.select('#id_p_1_3'))

# p태그 이면서 id 값이 id_p_1_1인 태그 정보
print(spoon.select('p#id_p_1_1'))

extract() 함수

수집한 html내용 중 불필요한 태그를 아래 예제와 같이 삭제할 수 있습니다. 아래 예제는 p태그만을 골라 삭제하는 예제입니다.

print(spoon)
print("=" * 100)
for i in spoon.find_all('p'):
    i.extract()
print(spoon)

# 결과
<!DOCTYPE html>
<html lang="ko"><head><style>.class_div_1{background-color:rgb(121, 121, 121);}#id_div_1{color:rgb(187, 68, 25);}.class_div_2{background-color:rgb(253, 255, 128);}#id_div_2{color:rgb(18, 199, 42);}</style><title>bs4 test</title></head><body><div class="class_div_1" id="id_div_1"><p class="class_p_1_1" id="id_p_1_1">div_1.p tag1</p><p> Captain </p><p class="class_p_1_2" id="id_p_1_2">div_1.p tag2</p><p> BIN </p><p class="class_p_1_3" id="id_p_1_3">div_1.p tag3</p><p> BLOG </p></div> <div class="class_div_2" id="id_div_2"><p class="class_p_2_1" id="id_p_2_1">div_2.p tag1</p><p>exam 2-1</p><p class="class_p_2_2" id="id_p_2_2">div_2.p tag2</p><p>exam 2-2</p><p class="class_p_2_3" id="id_p_2_3">div_2.p tag3</p><p>exam 2-3</p></div></body></html>
====================================================================================================
<!DOCTYPE html>
<html lang="ko"><head><style>.class_div_1{background-color:rgb(121, 121, 121);}#id_div_1{color:rgb(187, 68, 25);}.class_div_2{background-color:rgb(253, 255, 128);}#id_div_2{color:rgb(18, 199, 42);}</style><title>bs4 test</title></head><body><div class="class_div_1" id="id_div_1"></div> <div class="class_div_2" id="id_div_2"></div></body></html>

3편 (로또 사이트 크롤링) 계속...

captainbin.tistory.com/entry/%ED%8C%8C%EC%9D%B4%EC%8D%AC-%ED%81%AC%EB%A1%A4%EB%9F%AC-%EB%A7%8C%EB%93%A4%EA%B8%B0-3

파이썬 웹 크롤러? 웹 스크래퍼 만들기 - 3

파이썬 크롤러 로또 사이트 크롤링 하기 이 포스팅 글은 "파이썬 웹 크롤러? 웹 스크래퍼 만들기" 연재입니다. 파이썬 크롤러 환경 설정이나 크롤러에서 사용하는 모듈 사용법을 알고 싶으시면

captainbin.tistory.com