'깔끔하게 함수로' 태그의 글 목록

깔끔하게 함수로

파이썬 웹스크래핑with nomad coders[9] 2020.03.09

파이썬 웹스크래핑with nomad coders[9]

pavk96 2020. 3. 9. 21:45

2020. 3. 9. 21:45

3. 요소들 찾기

이전에 작성한 코드

import requests
from bs4 import BeautifulSoup

 
result = requests.get(URL)
s = BeautifulSoup(result.text, "html.parser")

pagination = s.find("div", {"class": "pagination"})
links = pagination.find_all('a')

pages = []
for link in links[:-1]:
	pages.append(int(link.string))
    
max_page = pages[-1]

니꼬쌤은 import를 제외한 코드들을 함수로 쪼개는 편이 좋다고 한다

그 이유는 깔끔하니까

main.py는 가장 먼저 실행되는 메인 파일이다

니꼬쌤은 indeed.py를 만들어 코드를 작성했다

커다란 장농에 칸마다 종류별 옷들을 정리하는 것 같았다

그래서 이것을 get_last_pages라는 함수를 만들어 그 안에 넣었다

def get_last_pages():
    result = requests.get(URL)
    s = BeautifulSoup(result.text, "html.parser")
    
    pagination = s.find("div", {"class": "pagination"})
    links = pagination.find_all('a')
  
    pages = []
    for link in links[:-1]:
        pages.append(int(link.string))
    
    max_page = pages[-1]
    return max_page

대부분의 함수는 전에 말했듯 반환해야 하는 값이 필요하다

우리가 찾고자하는 값은 마지막 페이지이므로 max_page를 반환 값으로 지정했다

이와 마찬가지로 제목과 회사, 위치, 링크의 각각의 태그들을 찾아준다

def extract_job(html):
    title = html.find("div", {"class": "title"}).find("a")["title"]
    company = html.find("span", {"class": "company"})
    company_anchor = company.find("a")
    if company_anchor is not None:
        company = str(company_anchor.string)
    else:
        company = str(company.string)
    company = company.strip()
    location = html.find("div", {"class": "recJobLoc"})["data-rc-loc"]
    job_id = html["data-jk"]
    return {
        "title": title,
        "company": company,
        "location": location,
        "link": f"https://kr.indeed.com/%EC%B1%84%EC%9A%A9%EB%B3%B4%EA%B8%B0?&jk={job_id}"
    }

extract_job함수로 모은 요소들

title = html.find("div", {"class": "title"}).find("a")["title"]

제목을 찾을 때 a태그의 문자열을 가져왔더니 문자열로 저장되지 않은 부분이 있었다

그래서 a태그 안에 있는 속성값 "title"로 구분을 해주었다

속성 값을 찾을 때는 [ ]로 표시한다

company = html.find("span", {"class": "company"})
    company_anchor = company.find("a")
    if company_anchor is not None:
        company = str(company_anchor.string)
    else:
        company = str(company.string)
    company = company.strip()
    location = html.find("div", {"class": "recJobLoc"})["data-rc-loc"]

회사명을 찾을 때는 a태그의 문자열을 가져왔더니 링크가 저장되지 않은 부분이 있었다

print 해보면 None이 나왔다

또 여기는 a태그 안에 있는 속성 값에도 회사명이 없었다

그래서 if else 조건문으로 저장된 부분은 a태그의 문자열을

저장되지 않은 부분은 span태그의 문자열을 가져오도록 했다

location = html.find("div", {"class": "recJobLoc"})["data-rc-loc"]

위치를 찾을 때도 제목처럼 속성값을 사용해 문자열을 찾았다

job_id = html["data-jk"]

링크는 클릭해서 들어가보면 페이지 URL맨 끝에 jk=어쩌고 저쩌고로 끝이 난다

어쩌고 저쩌고가 링크들을 구분할 수 있는 것이고 그것은 data-jk라는 속성에 있었다

return {
        "title": title,
        "company": company,
        "location": location,
        "link": f"https://kr.indeed.com/%EC%B1%84%EC%9A%A9%EB%B3%B4%EA%B8%B0?&jk={job_id}"
    }

마지막으로 리스트 형식으로 반환해주고 extract_job함수에 넣어주었다

내 턴을 종료했다

작성한 코드

import requests
from bs4 import BeautifulSoup

Limit = 50
URL = f'https://kr.indeed.com/%EC%B7%A8%EC%97%85?as_and=Python&as_phr=&as_any=&as_not=&as_ttl=&as_cmp=&jt=all&st=&as_src=&radius=25&fromage=any&limit={Limit}&sort=&psf=advsrch&from=advancedsearch'

def get_last_pages():
    result = requests.get(URL)
    s = BeautifulSoup(result.text, "html.parser")
    pagination = s.find("div", {"class": "pagination"})
    links = pagination.find_all('a')
    print(links)
    pages = []
    for link in links[:-1]:
        pages.append(int(link.string))
    print(pages)
    max_page = pages[-1]
    return max_page

def extract_job(html):
    title = html.find("div", {"class": "title"}).find("a")["title"]
    company = html.find("span", {"class": "company"})
    company_anchor = company.find("a")
    if company_anchor is not None:
        company = str(company_anchor.string)
    else:
        company = str(company.string)
    company = company.strip()
    location = html.find("div", {"class": "recJobLoc"})["data-rc-loc"]
    job_id = html["data-jk"]
    return {
        "title": title,
        "company": company,
        "location": location,
        "link": f"https://kr.indeed.com/%EC%B1%84%EC%9A%A9%EB%B3%B4%EA%B8%B0?&jk={job_id}"
    }

Limit= 50 은 URL의 jobcard개수를 50개로 제한한다는 의미이다

25로 고치면 25개를 볼 수 있다

저작자표시

'노마드코더스 아카데미 > 파이썬으로 웹스크래퍼 만들기' 카테고리의 다른 글

파이썬 네이버뉴스 스크래핑 (0)	2020.03.10
파이썬 웹스크래핑with nomad coders[10]끝!!! (0)	2020.03.09
파이썬 웹스크래핑with nomad coders[8] (0)	2020.03.08
파이썬 웹스크래핑with nomad coders[7] (0)	2020.03.08
파이썬 웹스크래핑with nomad coders[6] (0)	2020.03.08

PREV 이전 1 NEXT 다음

제발 제 코드에 훈수하세요