Data Engineering & Web Scraping

Project Overview

In data engineering, understanding the layers of abstraction is crucial for optimization and debugging. This project demonstrates the full stack of data acquisition by implementing the exact same web scraper in three distinct layers.

By extracting a specific data point ("Latest Update Date") from a university endpoint, I benchmarked the complexity and control offered by Shell Scripting (fastest prototyping), Raw TCP Sockets (maximum control), and High-Level APIs (production standard).

Key Concepts Implemented

Shell Scripting (Layer 1)

Used `wget`, `grep`, and `sed` to pipe HTTP streams directly into text processors, demonstrating Unix philosophy for rapid data extraction.

Raw Sockets (Layer 2)

Manually constructed HTTP GET requests and parsed raw byte streams over TCP, handling connection handshakes and buffering without external libraries.

High-Level APIs (Layer 3)

Utilized the `requests` library to abstract connection pooling and encoding handling, representing the industry standard for maintainable scrapers.

Source Code Comparison

1. Shell Script (`p4_hw6.sh`)

concise, effectively using regex for extraction.

#!/bin/bash

URL="https://web.physics.ucsb.edu/~phys129/lipman/"

echo "URL: $URL"

wget -q -O - "$URL" | grep -i "Latest update" | sed -e 's/.*">//' -e 's/<.*//' -e 's/ / /g'

2. Raw TCP Sockets (`p5_hw6.py`)

Manual HTTP implementation showing full control over the byte stream.

#!/usr/bin/env python3
#I am using a lot of codes from the client.py file
import sys
import os
import socket

def open_connection(ipn, prt):
   s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
   connect_error = s.connect_ex((ipn, prt))
   if connect_error:
      if connect_error == 111:
         print('Connection refused.  Check address and try again.', file=sys.stderr)
         sys.exit(1)
      else:
         print('Error %d connecting to %s:%d' % (connect_error,ipn,prt), file=sys.stderr)
         sys.exit(1)
   return(s)

def receive_data(thesock, nbytes):
   dstring = b''
   rcount = 0 
   thesock.settimeout(5)
   while rcount < nbytes:
      try:
         somebytes = thesock.recv(min(nbytes - rcount, 8192))
      except socket.timeout:
         print('Connection timed out.', file = sys.stderr)
         break
      if somebytes == b'':
         print('Connection closed', file = sys.stderr)
         break
      rcount = rcount + len(somebytes)
      dstring = dstring + somebytes
      
   print('\n %d bytes received. \n' % rcount)
   return(dstring)

def find_announcement_date(html_text):
   """Search for announcement update in HTML.
   """
   lines = html_text.split('\n')
   for i, line in enumerate(lines):
      if 'latest update' in line.lower():
         if 'Latest update:' in line:
            start = line.find('Latest update:')
            end = line.find('', start)
            if end == -1:
               end = line.find('', start)
            if end != -1:
               line = line[start:end]
            else:
               line = line[start:]
         
         cleanline = line.strip()
         while '<' in cleanline and '>' in cleanline:
            start_tag = cleanline.find('<')
            end_tag = cleanline.find('>', start_tag)
            if end_tag != -1:
               cleanline = cleanline[:start_tag] + cleanline[end_tag+1:]
            else:
               break
         
         cleanline = cleanline.replace(' ', ' ')
         print(cleanline.strip())
         return


if __name__ == '__main__':
   ipnum = 'web.physics.ucsb.edu'
   port = 80
   thesocket = open_connection(ipnum, port)
   http_request = b'GET /~phys129/lipman/ HTTP/1.0\r\n\r\n'
   thesocket.sendall(http_request)
   indata = receive_data(thesocket, 8192)
   thesocket.shutdown(socket.SHUT_RDWR)
   thesocket.close()
   datastring = indata.decode('utf-8', errors='ignore')
   find_announcement_date(datastring)

3. Requests Library (`p6_hw6.py`)

Production-grade implementation using high-level abstractions.

#!/usr/bin/env python3

import requests

response = requests.get('http://web.physics.ucsb.edu/~phys129/lipman/')

for line in response.text.split('\n'):
    if 'Latest update:' in line:
        start = line.find('Latest update:')
        end = line.find('', start)
        if end != -1:
            line = line[start:end]
        while '<' in line:
            line = line[:line.find('<')] + line[line.find('>')+1:]
        print(line.replace(' ', ' ').strip())
        break