[2] ACS Journal Crawler v1.1

글 작성자: Pharm_D

ACS Journal Crawler

ACS joural은 American Chemical Society, 미국 화학회에서 주관하는 논문들로 화학 관련 저명한 최신 연구 결과들을 확인할 수 있다. ACS 안에 JACS, JOC, OL 등 여러 세부 저널들이 존재하는데 이 각 저널마다 최신 논문을 조회할 수 있다.(ASAP - As Soon As Publishable)

Letter를 주로 발행하는 Organic Letters 같은 회지의 경우 하루에도 몇 편의 신규 논문들이 업로드 된다(영업일 기준). 따라서, 최신 연구동향을 확인하기 위해서는 각 저널 페이지의 ASAP를 모두 확인해야 하는 번거로움이 있었다.

그래서, 웹크롤러를 이용하여 원하는 저널의 정보를 엑셀 형태로 저장할 수 있는 파이썬 스크립트를 아래와 같이 제작하였다.

Code

from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup
import requests
from openpyxl import workbook
from io import BytesIO
import requests
import xlsxwriter

# 2021-10-07, v1.1
# update TOC in excel
#


### To-be : merge to one page ###

num = int(input("Please enter the number of journals : ", ))
workbook = xlsxwriter.Workbook('summary.xlsx')
ws = workbook.add_worksheet('TOC')

# TOC setting
ws.set_column('A:B', 4)
ws.set_column('C:C', 32)
ws.set_column('D:D', 100)
ws.set_column('E:E', 10)
ws.set_default_row(80)
ws.set_row(0, 25)
ws.write(0, 0, 'Num')
ws.write(0, 1, 'Abb')
ws.write(0, 2, 'TOC')
ws.write(0, 3, 'Title')
ws.write(0, 4, 'Date')
ws.write(0, 5, 'DOI')


# crwaling information from journal_url file
journal_url = pd.read_excel(
    './journal_url.xlsx', sheet_name='url', names=['Abb', 'Link'])
journal_urls = journal_url['Link'].tolist()
Abb = journal_url['Abb'].tolist()

row = 1  # TOC row
for i in range(len(journal_urls)):
    link = journal_urls[i]
    abbreviation = Abb[i]
    req = requests.get(link)
    soup = BeautifulSoup(req.content, 'html.parser')

    for j in range(num):
        toc = soup.select('div.issue-item_img > img')[j]  # Load the TOC
        image_url = 'https://pubs.acs.org' + str(toc)[35:-3]
        print(image_url)
        res = requests.get(image_url)
        image_data = BytesIO(res.content)  # Process the image file
        ws.insert_image('C%d' % (row+1), image_url,
                        {'x_scale': 0.45, 'y_scale': 0.45, 'image_data': image_data})
        ws.write(row, 0, row)
        ws.write(row, 1, abbreviation)
        row += 1
print("---------------------------------------------")
print("----1/3 Save the TOC from ACS publication----")
print("---------------------------------------------")


workbook.close()

# crawling information from journal_url file
journal_url = pd.read_excel(
    './journal_url.xlsx', sheet_name='url', names=['Abb', 'Link'])
journal_urls = journal_url['Link'].tolist()
Abb = journal_url['Abb'].tolist()

# Empty list and df
df = pd.DataFrame(columns=['Abb', 'No.', 'Title', 'Date', 'DOI'])
summary = []
doi = []
data = []
count = 1  # Paper number of Journal

# ACS journal option
for i in range(len(journal_urls)):
    link = journal_urls[i]
    abbreviation = Abb[i]
    options = webdriver.ChromeOptions()  # hide chromedriver
    options.add_argument("headless")  # hide chromedriver
    driver = webdriver.Chrome(
        'chromedriver.exe', options=options)  # hide chromedriver
    driver.get(link)
    req = requests.get(link)
    soup = BeautifulSoup(req.content, 'html.parser')

    for j in range(num):  # Crwaling the ACS ASAP journal
        title = driver.find_elements_by_class_name('issue-item_title')[j]
        date = driver.find_elements_by_class_name('pub-date-value')[j]
        doi = soup.select(
            'div.issue-item_metadata > span > h5 > a')[j]['href']  # Load the doi
        summary.append([abbreviation, count, title.text,
                       date.text, "https://pubs.acs.org/doi"+doi[4:]])  # can modify sci-hub
        count += 1
    driver.close()
df = df.append(pd.DataFrame(summary, columns=[
               'Abb', 'No.', 'Title', 'Date', 'DOI']))

print(df)
print("---------------------------------------------")
print("---2/3 Save the Bibloigraphic Information----")
print("---------------------------------------------")
writer = pd.ExcelWriter('summary.xlsx', mode='a', engine='openpyxl')
df.to_excel(writer, sheet_name='Information', index=False)

writer.save()
print("---------------------------------------------")
print("-----------------3/3 Finish------------------")
print("---------------------------------------------")

2년 전 처음 스크립트를 작성했을 땐 Selenium과 bs4를 함께 사용했지만 다시 수정한다면 보다 간편하게 작성할 수 있을 것 같다. 시간이 난다면 업데이트할 예정이다.

원문 및 코드는 아래에서도 확인할 수 있다.

깃허브

GitHub - ssrihappy/JournalASAP: Crawling the ASAP journals

Crawling the ASAP journals. Contribute to ssrihappy/JournalASAP development by creating an account on GitHub.

github.com

저작자표시 비영리 변경금지

'일상 이야기 > It Things' 카테고리의 다른 글

[6] 3D 프린터로 QR 코드 만드는 방법 (0)	2023.01.11
[5] z16p 휴대용 포터블 모니터 구매 / 모니터 스탠드 만드는 방법 (0)	2023.01.08
[4] 3D 프린터 챔버(Printer Chamber) 만들기 (0)	2022.11.03
[3] 3D 프린터 시작하기 (0)	2022.10.24
[1] It Things (0)	2022.09.01

[2] ACS Journal Crawler v1.1

ACS Journal Crawler

Code

'일상 이야기 > It Things' 카테고리의 다른 글

댓글

이 글 공유하기

티스토리툴바

ACS Journal Crawler

Code

'일상 이야기 > It Things' 카테고리의 다른 글

댓글

이 글 공유하기

다른 글

[5] z16p 휴대용 포터블 모니터 구매 / 모니터 스탠드 만드는 방법

[4] 3D 프린터 챔버(Printer Chamber) 만들기

[3] 3D 프린터 시작하기

[1] It Things

티스토리툴바