Skip to content

Latest commit

 

History

History
70 lines (62 loc) · 1.8 KB

notes.org

File metadata and controls

70 lines (62 loc) · 1.8 KB

Example of naive way of Parsing

s=urllib.urlopen("https://github.com/siddhant3s/sendsms").read()
l=s.find("""<div id="repository_description" rel="repository_description_edit">""")
s=s[l:]
l=s.find("<p>")
s=s[l:]
s=s[3:] #removing <p>
r=s.find("<span")
s[:r]
print s
'A python script to send sms non-interactively via fullonsms.com'

Results of University Website

http://uptu.ac.in/results/EVEN_SEMESTER_10_11/bte4_10_11.asp?rollno=0909110103 soup.find(text=”First Year”).next.next.string

BeautifulSoup

Basic Travarsal And Finding

soup.html soup.body soup.p.parent soup(‘p’) soup.find(‘p’) soup.findAll(‘p’) soup.findAll(‘div’) len(soup.findAll(‘div’)) soup.findAll(‘div’, id=”wrapper”)

soup.findAll(‘div’, onclick=”window.location.reload()”) #festember soup.findAll(‘div’, onclick=”window.location.reload()”)[0].string soup.findAll(‘div’, onclick=”window.location.reload()”)[0][‘class’]

soup.findAll(‘div’, id=”wrapper”)[0][‘id’] soup.findAll(‘div’, id=”mainWrap”)[0].header

Bad HTML

from BeautifulSoup import BeautifulSoup html = “<html><p>Para 1<p>Para 2<blockquote>Quote 1<blockquote>Quote 2” soup = BeautifulSoup(html) print soup.prettify()

Unicode

Parsing

.parent .content

for x in soup.body: print x

Searching

findAll

  • regex
  • attrs
  • list of tags to find [‘table’,’p’]
  • a function
  • Keyword as argument to findAll
  • CSS class shortcut
  • calling findAll equals calling tag

find

searches only first one

Youtube Example

Simple Cookie Example

Screen Setup

VNC Server