It is currently Wed Mar 22, 2017 7:07 pm

Welcome to rfobasic

You are currently viewing our boards as a guest, which gives you limited access to view most discussions and access our other features. By joining our free community, you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content, and access many other special features. In addition, registered members also see less advertisements. Registration is fast, simple, and absolutely free, so please, join our community today. **You are not required to provide truthful information to any registration questions. Be whomever you wish to be.!

Post new topic Reply to topic  [ 1 post ] 
Author Message
 Post subject: DOM Scraping
Unread postPosted: Fri Jan 01, 2016 1:30 pm 

Joined: Wed Feb 15, 2012 11:24 pm
Posts: 256
Location: TN
So here lately I have been working on a UserAgent headless "browser" (if u will) in basic. While the full code is not quite rdy, I was recently faced with the challenge of creating a DOM Tree to easily pick apart web pages. Now, bare with me on this tutorial... there may be easier ways to get this done.

Lets create a "parser.html", you can save it to /rfo-basic/data/
And here it is:
<script src=""></script>

$.fn.findSelector = function(selector) {
    var el = $(selector).val();

$(document).ready(function() {

function getDOM(selector) {

Now you may notice a little confusion, that's ok.
We will be calling a js function from BASIC! called getDOM().
This will call a jquery function which will find what we need.

Notice that the .ready() event will call back to BASIC!

OK. lets look at BASIC! code now.
We will create a function that we will pass a element selector and some html to
fn.def DOM$(selector$, html$)
Rem lets load the parser r, FT, "parser.html"
Text.readln FT, line$
if line$ <> "EOF" THEN
parser$ += line$ + chr$(10)
until line$ = "EOF"
text.close FT

Rem now lets take out html and inject the parser code into it. There is several ways to do this, however, will will just replace the <html> occurrence, there should only be one
html$ = REPLACE$("<html>", "<html>"+parser$, html$)

Rem now lets save out new html with parser injected w, FT, "HTML.html"
Text.writeln FT, html$
text.close FT

Rem and now to load 'er up
HTML.load.URL "HTML.html"

print "Selector: " + selector$

Rem this is called to BASIC! to let it know the page is rdy for scraping
if data$ = "DAT:rdy_Cid" then
   print "RDY!"
   inj$ = "javascript:getDOM(\""+selector$+"\");"
   Rem call getDOM() when rdy
UNTIL data$ <> "" & data$ <> "DAT:rdy_Cid"

fn.rtn "RTN: " + mid$(data$, 5)

Print DOM$("input", html$)

We are done. But how can we do a little more?

Well recently we learned that we can pass null values into a string to split by into an array.
Look at the jquery code one more time.
var el = $(selector);

if(typeof el !== "undefined")
  Var s = el.val() + ascii("NULL");
  s += el.text() + ascii("NULL");
  s += el.attr("name") + ascii("NULL");

function ascii (a)
return a.charCodeAt();

And ofc we can do this all day no problems. Adding as many things as we would like and formatting it however we want.

And your BASIC! side would look like this:

S$ = DOM$("input", html$)
Gosub makepretty
Print el$[name]
Print el$[val]
Print ":)"

Goto skipmakepretty
split s$, el$[], chr$(0)
Let val = 1
Let text = 2
Let name = 3

I want to say sorry ahead of time for any possible typos. I spent like 1hr typing all of this up fresh because I don't have a computer right now.

This is just to show you the general idea. We could easily use jquery to dump the entire domain tree and all attributes of every element into a multidimensional array or list.

Good luck and happy coding

Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 1 post ] 

Who is online

Users browsing this forum: No registered users and 1 guest

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: