This will be a multi part script that will:
1. record project name and data field
2. learn data locations via a web based inter active script
3. retrieve data automatically
4. report any errors in retrieval process
The scraped data will need to be incorporated into a mysql database for data extraction by an existing website. New pages will be need for this as a secondary project.
## Deliverables
I need a complex webscraper built for me. I say complex because it will be required
to pull data from many web sites of different layouts. The first task for the winning bidder
will be to create an input file of urls from my existing database. Each of these urls will be the home page of
one of the sites we will be collecting data from. This input file creation should be very simple as my
current website displays these urls on one of my pages. This file will contain 2 pieces of data:
the websites unique number and the url of the websites home page.
There will be 4 parts to actual scraper script.
The first will part will work with a user to name a project and all of the data fields that will need to be captured.
For my first project with the script that you will build there might be 8 to 12 pieces of data that will need to be collected
from each site and they may recide on multiple pages. Each of these data fields will need to have a unique name given to them.
So, I might call the project "toy prices" and the 8 data fields might be, "mattel-truck", "Hess-truck", "dump-truck", etc, etc.
.
The second part of the script will work as web based interactive program. In this part
of the script each data fields location at every website in the input file (both url and exact location on the page) will be
recorded by the script with the help of a user. The script will start by reading from the input
file of urls one at a time and display the home page of the 1st site in a work box on the users screen. By
"work box" I mean that part of the screen will be for the user to communicate to the script
(like the header and left hand column) while the rest of the screen will show the actual website url data screen.
The user will then
go through each of the data fields needed from this site one by one and define the url and exact page location on the screen
so that the script can record this information of each of the fields for later automatic retrieval
in part three of the script. In order to do this the user must be able
to change the url (navigate from the home page) to get to the proper url where
the data resides. The user will select each of the data fields (maybe they will all show on the
left hand column of the users screen) one at a time and then highlight
(select) the data field on the website. From the users highlighting
of the data field the script must be able to record each data fields exact position
so that in the end:
For every data field at every website we want to collect data from, the script will learn and create a record.
The record layout will look something like this:
Positons:
1-6 website unique number
7-29 data-field-1-name
30-60 data-field-1-name-description (text/decimal/size)
61-90 data-field-1-name-url
91-119 data-field-1-name-page-location (starting row/column)
121-130 current date of data collection
131-140 exact time of data capture
141-150 data-field-1-data
151-180 error-message-if-any blank if none
So, if there were 1,000 websites to collect data from and 8 pieces of data to collect from each we should have
8,000 records in the project file that shows the exact location of of piece of data and the data itself along with
any error message there might be if the data could not be collected. I.e.
The url was no good or the data was supposed to be decimal but the script found text... etc. All of these
8,000 records will be recorded/written during the user interactive second section of the script.
Also, with this file you can see how we could selectivly go out and scrape the data for just one website or
go to every website and just gather data-field-2 from all of them or... etc. etc. It will be able to do this because
in section three of the script, the automated retrieval of the data, it will first read an input record that will
contain the information it will use to determin exactly what to do. This auto update section will need to be a cron type job. The third section auto update record will look something
like this:
Position:
1-9 starting website number
10-20 ending website number
30 If postion 30 has a 1 in it then get data-field-1 if it's zero do not
31 If postion 31 has a 1 in it then get data-field-2 if it's zero do not
32 If postion 32 has a 1 in it then get data-field-3 if it's zero do not
33 "
34 "
35
36
37
38 All the way through data-field-8
From this record we can see that if the starting website number is 1 and the ending number is
equal to the last website and all of the datafield characters are set at 1 then the script will go and retrieve
all 8 data-fields from all 1000 websites.
The fourth section of the script will be the exception reporting. During the auto updating cycle any time the script
encounters an error a message should be written on the record as well as to an error report. This error report will
discribe the error as best as possible so that a user can use section two of the script to correct the defined position for
the error that was encountered.