Thursday, July 23, 2009

Extracting data from web pages with JavaScript

Here is example how to extract data from Wikipedia by executing custom JavaScript on a page.

Create script customscript.js.

// this script converts table on a Wikipedia page to plain text
// and shows this text in a new window, works in Firefox
var cellDelimiter = "|";
var rowDelimiter = "\n";
var table = document.getElementById('sortable_table_id_0');

var content = "";
var rows = table.rows;
for (var i = 0; i < rows.length; i++) {
    var cells = rows[i].cells;
    var cellArray = [];
    for (var j = 0; j < cells.length; j++) {
        var cellText = cells[j].textContent.replace(/^\s+|\s+$/g, ""); // trim
        cellArray.push(cellText);
    }
    content += cellArray.join(cellDelimiter) + rowDelimiter;
}

var out = window.open().document;
out.open("text/plain");
out.write(content);
out.close();


Call this script with bookmarklet.

javascript:void(s=document.createElement('script'));
void(s.src='http://localhost:8080/webapp/scripts/customscript.js');
void(document.body.appendChild(s));

Here script is loaded from server which runs locally. It could also be loaded from local file system, but special preferences should be set in Firefox to allow this.

Results for List of national capitals page:

City|Country
Abu Dhabi|United Arab Emirates
Abuja|Nigeria
Accra|Ghana
Adamstown|Pitcairn Islands
Addis Ababa|Ethiopia
Algiers|Algeria
Alofi|Niue
Amman|Jordan
Amsterdam|Netherlands (official)
Andorra la Vella|Andorra
Ankara|Turkey
...

1 comments:

  1. Interesting points on extracting data, I use python for simple html extracting data, but for larger projects like documents, the web, or files i tried extracting data from the web which worked great, they build quick custom screen scrapers, extracting data, and data parsing programs

    ReplyDelete