Python/Web Crawling+Scraping

Web Crawling(Scraping) ๊ฐœ์š” + DOM ๊ธฐ์ดˆ

metamong 2022. 4. 5.

๐Ÿ•ต๐Ÿป ์šฐ๋ฆฌ๊ฐ€ ์ฃผ์–ด์ง„ data๋Š” ๋Œ€๊ฒŒ ์ •๋ˆ๋œ structured data๊ฐ€ ์•„๋‹ ํ™•๋ฅ ์ด ๋†’๋‹ค. ํŠนํžˆ web์— ํฉ๋ฟŒ๋ ค์ง„ data๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ๊ฒฝ์šฐ๊ฐ€ ์ข…์ข… ์žˆ๋Š”๋ฐ, ์ด ๋•Œ web scraping์ด๋ผ๋Š” ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•จ! (๊ฐœ์ธ์ ์œผ๋กœ ๋„ˆ๋ฌด ์žฌ๋ฐŒ๋Š” web scraping ๐Ÿ„๐Ÿป‍โ™€๏ธ)

 

Q. Web Scraping vs. Web Crawling?

 

โ‰ซ web scraping์€ ์šฐ๋ฆฌ๊ฐ€ ์ฐพ์„ 'data'์— ์ดˆ์ ์„ ๋‘” ๊ฒƒ / web crawling์€ ์šฐ๋ฆฌ๊ฐ€ ์ฐพ์„ ์žฅ์†Œ์ธ 'url'์— ์ดˆ์ ์„ ๋‘” ๊ฒƒ. ๋Œ€๊ฒŒ crawling๊ณผ scraping ๊ณผ์ •์„ ๋ณ‘ํ–‰ํ•œ๋‹ค. (web scraping is about extracting the data from one or more websites. While crawling is about finding or discovering URLs or links on the web.)

 

โ‰ซ crawling์œผ๋กœ url์„ ์ฐพ๊ณ , scraping์œผ๋กœ ์ฐพ์€ url ํ™”๋ฉด ๋‚ด์˜ data๋ฅผ ๋ฝ‘์•„๋ƒ„! (you need to combine crawling and scraping. So you first crawl - or discover - the URLs, download the HTML files, and then scrape the data from those files. This means you extract data and do something with it, like storing it in a database or further processing it.)

 

โ‰ซ crawling ์˜ˆ์‹œ

1> search engines crawl the web so they can index pages and display them in the search results
→ search engine์„ ํ†ตํ•ด ์ฐพ๊ณ ์ž ํ•˜๋Š” page ์œ„์น˜๋ฅผ ์•Œ์•„๋ƒ„

 

2> when you have one website that you want to extract data from - in this case you know the domain - but you don't have the page URLs of that specific website. So you don't know what pages to scrape. So first you create a crawler that will output all the page URLs that you care about - it can be pages in a specific category on the site or in specific parts of the website. Or maybe the URL needs to contain some kind of word for example and you collect all those URLs - and then you create a scraper that extracts predefined data fields from those pages

→ ๋„๋ฉ”์ธ์€ ์•Œ์ง€๋งŒ ์–ด๋–ค page๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ง€ ์•„์ง ์ •ํ•˜์ง€ ์•Š์•˜์„ ๋•Œ crawler๋ฅผ ํ†ตํ•ด ๋ชจ๋“  page ๊ฐ๊ฐ์— ํ•ด๋‹นํ•˜๋Š” url์„ ์ถœ๋ ฅํ•œ๋‹ค.


!-- ๊ทธ๋Ÿฌ๋ฉด data๋ฅผ ๊ฐ€์ ธ์˜ฌ ๊ณต๊ฐ„์ธ web์— ๋Œ€ํ•ด์„œ r์•„๋ณดza --!

1. ์›น์‚ฌ์ดํŠธ ๊ตฌ์กฐ

๐Ÿ•ต๐Ÿป ์›น ์‚ฌ์ดํŠธ์˜ ๋ผˆ๋Œ€๋ฅผ ์™„์„ฑํ•˜๋Š” ์–ธ์–ด๋Š” HTML์ด๋‹ค. ํ”ํžˆ ๋“ค์–ด๋ณธ CSS๋‚˜ JS๋Š” ๊พธ๋ฉฐ์ฃผ๋Š” ์šฉ๋„๋กœ, ๋ถ€๊ฐ€์ ์ธ ๊ธฐ๋Šฅ์— ์“ฐ์ž„.
(CSS๋Š” ๊พธ๋ฏธ๊ณ , JS๋Š” ์‹คํ–‰์‹œํ‚ค๊ณ !)
(ํ‹ฐ์Šคํ† ๋ฆฌ ์Šคํ‚จํŽธ์ง‘๋„ ๋ชจ๋‘ HTML ์–ธ์–ด๋ฅผ ์ˆ˜์ •ํ•˜๋Š” ๊ฒƒ!)

 

→ ์ผ๋‹จ HTML์€ Hyper Text Markup Language์˜ ์ค„์ž„๋ง๋กœ, ์›น ๋ธŒ๋ผ์šฐ์ €์— ๋ณด์—ฌ์ง€๋Š” ๋ฌธ์„œ๋ฅผ ์œ„ํ•œ ํ‘œ์ค€ ๋งˆํฌ์—… ์–ธ์–ด์ด๋‹ค.

(์—ฌ๊ธฐ์„œ markup language๋ž€ ํ”„๋ฆฌ์Šคํƒ€์ผ๋กœ ๋‹จ์ˆœํžˆ ๊ธ€์„ ์ž‘์„ฑํ•œ ์–ธ์–ด๊ฐ€ ์•„๋‹ˆ๋ผ, tag๋ฅผ ๋‹ฌ์•„ ์ž‘์„ฑํ•œ ๊ตฌ์กฐ์ ์ธ ์–ธ์–ด๋ฅผ ๋œปํ•จ)

 

(HTML5, CSS3 ๊ด€๋ จ ๊ฐœ๋… ํ•˜๋‹จ ์ฐธ๊ณ  ↓↓)

 

Crawling์— ํ•„์š”ํ•œ HTML5 & CSS3 (๊ฐ„๋‹จ ์ •๋ฆฌ)

1. HTML5 [1] ๊ฐœ์š” ๋ฐ ๊ฐœ๋… * ํ‘œ์ค€ web์ด๋ผ๊ณ  ๋ถ€๋ฅด๋Š” HTML5๋Š” ์ฃผ๋กœ ๋‚ด์šฉ์„ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์— ์ค‘์ ์„ ๋‘  * HTML ์š”์†Œ๋ฅผ ํ†ตํ•ด ์›น ํŽ˜์ด์ง€ ๊ตฌ์กฐ์™€ ์˜๋ฏธ๋ฅผ ์ •์˜ํ•œ๋‹ค * ์‹œ์ž‘ ํƒœ๊ทธ์™€ ๋ ํƒœ๊ทธ๋ฅผ ์ •์˜ํ•˜๊ณ , ๊ทธ ์‚ฌ์ด

sh-avid-learner.tistory.com

 

→ mozilla ๊ณต์‹ docu์— ์˜ํ•˜๋ฉด web site๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ ๊ณตํ†ต์ ์ธ ์š”์†Œ๋ฅผ ๊ณต์œ ํ•จ!

 

โ‘  header: Usually a big strip across the top with a big heading, logo, and perhaps a tagline. This usually stays the same from one webpage to another. (์›น์‚ฌ์ดํŠธ์— ํฐ ์ œ๋ชฉ์ด ์ฃผ๋กœ ๋“ค์–ด๊ฐ€๋Š” ๋Œ€ํ‘œ์ ์ธ box - header)

 

โ‘ก navigation bar: Links to the site's main sections; usually represented by menu buttons, links, or tabs. Like the header, this content usually remains consistent from one webpage to another — having inconsistent navigation on your website will just lead to confused, frustrated users. Many web designers consider the navigation bar to be part of the header rather than an individual component, but that's not a requirement; in fact, some also argue that having the two separate is better for accessibility, as screen readers can read the two features better if they are separate. (์‚ฌ์ดํŠธ์˜ ๋ฉ”์ธ์„น์…˜ - ๋‹ค๋ฅธ ๋ถ€๋ถ„์œผ๋กœ ์—ฐ๊ฒฐํ•ด์ฃผ๋Š” ์ผ์ข…์˜ ์—ฌ๋Ÿฌ ์—ฐ๊ฒฐ๊ณ ๋ฆฌ box)

 

โ‘ข main content: A big area in the center that contains most of the unique content of a given webpage, for example, the video you want to watch, or the main story you're reading, or the map you want to view, or the news headlines, etc. This is the one part of the website that definitely will vary from page to page! (๋– ์žˆ๋Š” ํ•ด๋‹น ํ™”๋ฉด์˜ ์ฃผ์š” ์›นํŽ˜์ด์ง€ ๋‚ด์šฉ)

 

โ‘ฃ sidebar: Some peripheral info, links, quotes, ads, etc. Usually, this is contextual to what is contained in the main content (for example on a news article page, the sidebar might contain the author's bio, or links to related articles) but there are also cases where you'll find some recurring elements like a secondary navigation system. (์ฃผ๋กœ main content์™€ ์—ฐ๊ด€๋˜์–ด ์žˆ๋Š” ์„œ๋ธŒ ๋‚ด์šฉ์ด ๋‚˜์™€ ์žˆ๋Š” ์ฐฝ)

 

โ‘ค footer: A strip across the bottom of the page that generally contains fine print, copyright notices, or contact info. It's a place to put common information (like the header) but usually, that information is not critical or secondary to the website itself. The footer is also sometimes used for SEO purposes, by providing links for quick access to popular content. (์ฃผ๋กœ ๋œ ์ค‘์š”ํ•œ ๋‚ด์šฉ์ด๋‚˜ ์ฐธ๊ณ ํ•  ๋ถ€๋ถ„๋“ค, page ๊ถŒํ•œ, ์ €์ž‘๊ถŒ, ์—ฐ๋ฝ์ฒ˜์™€ ๊ฐ™์€ ๋‚ด์šฉ๋“ค์ด ํ•˜๋‹จ์— ํ‘œ์‹œ๋จ)

 

- ํ•˜๋‹จ ๊ทธ๋ฆผ web site ์˜ˆ์‹œ -

(ํ•ด๋‹น ์˜ˆ์‹œ์—๋Š” header์— navigation bar๊ฐ€ ๋“ค์–ด๊ฐ€ ์žˆ๋‹ค.

์ค‘๊ฐ„์— main content, ์šฐ์ธก์— sidebar, ๊ทธ๋ฆฌ๊ณ  ํ•˜๋‹จ์— footer ์กด์žฌ)

 


2. DOM(Document Object Model)

๐Ÿ‘ฉ‍๐Ÿ”ฌ '๋ฌธ์„œ ๊ฐ์ฒด ๋ชจ๋ธ'์ด๋ผ๊ณ ๋„ ๋ถˆ๋ฆฌ๋Š” ์ด DOM์€ HTML, XML ๋“ฑ ๋ฌธ์„œ์˜ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ธํ„ฐํŽ˜์ด์Šค์ด๋‹ค. DOM์„ ํ†ตํ•ด ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด๋ฅผ ํ†ตํ•ด์„œ ๋ฌธ์„œ์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค€๋‹ค. ์ฆ‰ DOM์€ ๋ฌธ์„œ์™€ '๋‚ด๊ฐ€ ์ง์ ‘ ์ž์œ ์ž์žฌ๋กœ ์งค ์ˆ˜ ์žˆ๋Š” ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด' ์‚ฌ์ด์˜ ๋‹ค๋ฆฌ ์—ญํ• !

 

→ HTML, XML ๋ฌธ์„œ๋ฅผ ์œ„ํ•œ ์ผ์ข…์˜ API ์ข…๋ฅ˜ ์ค‘ ํ•˜๋‚˜

ํŠธ๋ฆฌ๊ตฌ์กฐ์™€ ๊ฐ™์ด ๋…ผ๋ฆฌ์ ์ธ ๊ตฌ์กฐ๋ฅผ ๊ฐ–๊ณ  ์žˆ์Œ

→ ์ฃผ์–ด์ง„ ๋ฌธ์„œ๋ฅผ DOM parser๋ฅผ ์ด์šฉํ•ด parsingํ•˜๋ฉด, ํ•ด๋‹น ๋ฌธ์„œ์˜ ๋ชจ๋“  element๋ฅผ ํฌํ•จํ•œ tree ํ˜•ํƒœ์˜ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์Œ

 

๐Ÿ‘ฉ‍๐Ÿ”ฌ DOM์˜ ์žฅ์ ?

 Used for manipulating document structures. (๋ฌธ์„œ๋ฅผ ์†์‰ฝ๊ฒŒ 'ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด ๋งŒ์œผ๋กœ' ์กฐ์ž‘ ๊ฐ€๋Šฅ!)

 Data persists in memory. (data๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ์— ๊ณ„์† ๋‚จ์Œ!)

 You can go forwards and backward in the tree (random access) (์ˆœ์„œ ํ•„์š”์—†์ด tree ๊ณ„์ธต ๊ตฌ์กฐ์˜ ๋ชจ๋“  element์— ์ž์œ ๋กญ๊ฒŒ ์ ‘๊ทผ ๊ฐ€๋Šฅ)

 You can make changes directly to the tree in memory. (๋ฉ”๋ชจ๋ฆฌ์— ๋‚จ์€ tree ๊ตฌ์กฐ ๋‚ด์šฉ์„ ๋ฐ”๋กœ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ์Œ!)

 

๐Ÿ‘ฉ‍๐Ÿ”ฌ DOM ์—ฌ๋Ÿฌ method (๋Œ€ํ‘œ์ ์ธ ๊ฒƒ๋งŒ - ์ œ์ผ ๋งŽ์ด ์“ฐ์ด๋Š”!)

 

DOM method ์ข…๋ฅ˜ method ๊ธฐ๋Šฅ
getElementsbyTagName tag ์ด๋ฆ„์„ ๊ฐ€์ง„ ๋ฌธ์„œ element๋“ค return
getElementById ํ•ด๋‹น id์™€ ์ผ์น˜ํ•˜๋Š” tag element return
getElementsByClassName ํ•ด๋‹น class์— ์†ํ•˜๋Š” ๋ชจ๋“  tag๋“ค elements return
querySelector ํ•ด๋‹น selector์™€ ์ผ์น˜ํ•˜๋Š” element return
querySelectorAll ํ•ด๋‹น selector์™€ ์ผ์น˜ํ•˜๋Š” ๋ชจ๋“  elements return

 

๐Ÿ‘ฉ‍๐Ÿ”ฌ ์‹ค์ œ web page crawling + scrapingํ•  ๋•Œ ๊ฑฐ๋Œ€ํ•œ data๋ฅผ ์ผ์ผ์ด text ์ทจ๊ธ‰์„ ํ•ด์„œ ๊ฐ€์ ธ์˜ค๊ธฐ์—” ๋„ˆ๋ฌด ๋น„ํšจ์œจ์ ์ด๋‹ค. ๋”ฐ๋ผ์„œ ์šฐ๋ฆฌ๋Š” DOM method๋ฅผ ํ†ตํ•ด ์›ํ•˜๋Š” DOM tree๋กœ ํ‘œํ˜„๋œ ๋‚ด์šฉ ์ค‘ ์ผ๋ถ€ element๋งŒ ๊ฐ€์ ธ์˜ค๋Š” ํ™œ๋™์„ ํ•ด์•ผํ•  ๊ฒƒ์ž„!

 

๐Ÿ‘ฉ‍๐Ÿ”ฌ ์‹ค์Šต> chrome browser์— ์›ํ•˜๋Š” page๋ฅผ ๋„์šฐ๊ณ  ์šฐํด๋ฆญ → ๊ฒ€์‚ฌ ๋ฒ„ํŠผ Console ์ฐฝ DOM method code๋ฅผ ํ†ตํ•ด ์›ํ•˜๋Š” element ๋ถˆ๋Ÿฌ์˜ค๊ธฐ ๊ฐ€๋Šฅ!

 

- ์•„๋ž˜์™€ ๊ฐ™์ด DOM์€ tree ํ˜•ํƒœ๋กœ ์—ฌ๋Ÿฌ element๋“ค์ด ๋‚˜์—ด๋˜์–ด ๊ตฌ์„ฑ! -

 


* ์ธ๋„ค์ผ ์ถœ์ฒ˜) https://www.dreamstime.com/web-data-scraping-color-icon-screen-scraping-web-data-extractor-robotic-process-automation-web-harvesting-automatic-cleaning-image178649664

* ์ถœ์ฒ˜1) <์ฐธ๊ณ ํ•œ ๊ฐ•์ขŒ> https://www.youtube.com/watch?v=i0FN-OwJ7QI&t=130s 

* ์ถœ์ฒ˜2) <html> https://en.wikipedia.org/wiki/HTML

* ์ถœ์ฒ˜3) mozilla ๊ณต์‹ docu https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduction_to_HTML/Document_and_website_structure

* ์ถœ์ฒ˜4) web scraping & web crawling https://www.zyte.com/learn/difference-between-web-scraping-and-web-crawling/

* ์ถœ์ฒ˜5) DOM https://www.w3.org/TR/REC-DOM-Level-1/introduction.html

 

๋Œ“๊ธ€