Data Engineering & Web Scraping

A comparative implementation of the entire data acquisition stack: from Bash scripting to Raw Sockets and High-Level APIs. 데이터 수집 스택 전체를 비교 구현했습니다: Bash 스크립팅부터 Raw 소켓, 고수준 API까지.

Bash Scripting TCP Sockets Python Requests HTML Parsing

Project Overview프로젝트 개요

In data engineering, understanding the layers of abstraction is crucial for optimization and debugging. This project demonstrates the full stack of data acquisition by implementing the exact same web scraper in three distinct layers.데이터 엔지니어링에서 추상화 계층을 이해하는 것은 최적화와 디버깅의 핵심입니다. 동일한 웹 스크래퍼를 세 가지 레이어로 구현해 데이터 수집 스택 전체를 비교했습니다.

By extracting a specific data point ("Latest Update Date") from a university endpoint, I benchmarked the complexity and control offered by Shell Scripting (fastest prototyping), Raw TCP Sockets (maximum control), and High-Level APIs (production standard).대학 웹 엔드포인트에서 특정 데이터(“Latest Update Date”)를 추출하며, Shell 스크립팅(빠른 프로토타입), Raw TCP 소켓(최대 제어), 고수준 API(실무 표준)의 복잡도와 제어 범위를 비교했습니다.

Key Concepts Implemented구현한 핵심 개념

Shell Scripting (Layer 1)쉘 스크립팅 (1계층)

Used `wget`, `grep`, and `sed` to pipe HTTP streams directly into text processors, demonstrating Unix philosophy for rapid data extraction.`wget`, `grep`, `sed`로 HTTP 스트림을 텍스트 처리 파이프라인에 직접 연결해, 유닉스 철학 기반의 빠른 데이터 추출을 구현했습니다.

Raw Sockets (Layer 2)Raw 소켓 (2계층)

Manually constructed HTTP GET requests and parsed raw byte streams over TCP, handling connection handshakes and buffering without external libraries.HTTP GET 요청을 수동 구성하고 TCP 바이트 스트림을 직접 파싱하여, 외부 라이브러리 없이 핸드셰이크/버퍼링을 처리했습니다.

High-Level APIs (Layer 3)고수준 API (3계층)

Utilized the `requests` library to abstract connection pooling and encoding handling, representing the industry standard for maintainable scrapers.`requests` 라이브러리로 연결 풀링/인코딩 처리를 추상화해, 유지보수 가능한 스크래퍼의 실무 표준 방식을 구현했습니다.

Source Code Comparison소스 코드 비교

1. Shell Script (`p4_hw6.sh`)

concise, effectively using regex for extraction.

#!/bin/bash

URL="https://web.physics.ucsb.edu/~phys129/lipman/"

echo "URL: $URL"

wget -q -O - "$URL" | grep -i "Latest update" | sed -e 's/.*">//' -e 's/<.*//' -e 's/ / /g'

2. Raw TCP Sockets (`p5_hw6.py`)

Manual HTTP implementation showing full control over the byte stream.

#!/usr/bin/env python3
#I am using a lot of codes from the client.py file
import sys
import os
import socket

def open_connection(ipn, prt):
   s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
   connect_error = s.connect_ex((ipn, prt))
   if connect_error:
      if connect_error == 111:
         print('Connection refused.  Check address and try again.', file=sys.stderr)
         sys.exit(1)
      else:
         print('Error %d connecting to %s:%d' % (connect_error,ipn,prt), file=sys.stderr)
         sys.exit(1)
   return(s)

def receive_data(thesock, nbytes):
   dstring = b''
   rcount = 0 
   thesock.settimeout(5)
   while rcount < nbytes:
      try:
         somebytes = thesock.recv(min(nbytes - rcount, 8192))
      except socket.timeout:
         print('Connection timed out.', file = sys.stderr)
         break
      if somebytes == b'':
         print('Connection closed', file = sys.stderr)
         break
      rcount = rcount + len(somebytes)
      dstring = dstring + somebytes
      
   print('\n %d bytes received. \n' % rcount)
   return(dstring)

def find_announcement_date(html_text):
   """Search for announcement update in HTML.
   """
   lines = html_text.split('\n')
   for i, line in enumerate(lines):
      if 'latest update' in line.lower():
         if 'Latest update:' in line:
            start = line.find('Latest update:')
            end = line.find('', start)
            if end == -1:
               end = line.find('

', start) if end != -1: line = line[start:end] else: line = line[start:] cleanline = line.strip() while '<' in cleanline and '>' in cleanline: start_tag = cleanline.find('<') end_tag = cleanline.find('>', start_tag) if end_tag != -1: cleanline = cleanline[:start_tag] + cleanline[end_tag+1:] else: break cleanline = cleanline.replace(' ', ' ') print(cleanline.strip()) return if __name__ == '__main__': ipnum = 'web.physics.ucsb.edu' port = 80 thesocket = open_connection(ipnum, port) http_request = b'GET /~phys129/lipman/ HTTP/1.0\r\n\r\n' thesocket.sendall(http_request) indata = receive_data(thesocket, 8192) thesocket.shutdown(socket.SHUT_RDWR) thesocket.close() datastring = indata.decode('utf-8', errors='ignore') find_announcement_date(datastring)

3. Requests Library (`p6_hw6.py`)

Production-grade implementation using high-level abstractions.

#!/usr/bin/env python3

import requests

response = requests.get('http://web.physics.ucsb.edu/~phys129/lipman/')

for line in response.text.split('\n'):
    if 'Latest update:' in line:
        start = line.find('Latest update:')
        end = line.find('', start)
        if end != -1:
            line = line[start:end]
        while '<' in line:
            line = line[:line.find('<')] + line[line.find('>')+1:]
        print(line.replace(' ', ' ').strip())
        break