Coder Needed For Complex Web Scraping Script

$100-500 USD

Cancelado

Publicado

hace más de 14 años

$100-500 USD

Pagado a la entrega

This will be a multi part script that will: 1. record project name and data field 2. learn data locations via a web based inter active script 3. retrieve data automatically 4. report any errors in retrieval process The scraped data will need to be incorporated into a mysql database for data extraction by an existing website. New pages will be need for this as a secondary project. ## Deliverables I need a complex webscraper built for me. I say complex because it will be required to pull data from many web sites of different layouts. The first task for the winning bidder will be to create an input file of urls from my existing database. Each of these urls will be the home page of one of the sites we will be collecting data from. This input file creation should be very simple as my current website displays these urls on one of my pages. This file will contain 2 pieces of data: the websites unique number and the url of the websites home page. There will be 4 parts to actual scraper script. The first will part will work with a user to name a project and all of the data fields that will need to be captured. For my first project with the script that you will build there might be 8 to 12 pieces of data that will need to be collected from each site and they may recide on multiple pages. Each of these data fields will need to have a unique name given to them. So, I might call the project "toy prices" and the 8 data fields might be, "mattel-truck", "Hess-truck", "dump-truck", etc, etc. . The second part of the script will work as web based interactive program. In this part of the script each data fields location at every website in the input file (both url and exact location on the page) will be recorded by the script with the help of a user. The script will start by reading from the input file of urls one at a time and display the home page of the 1st site in a work box on the users screen. By "work box" I mean that part of the screen will be for the user to communicate to the script (like the header and left hand column) while the rest of the screen will show the actual website url data screen. The user will then go through each of the data fields needed from this site one by one and define the url and exact page location on the screen so that the script can record this information of each of the fields for later automatic retrieval in part three of the script. In order to do this the user must be able to change the url (navigate from the home page) to get to the proper url where the data resides. The user will select each of the data fields (maybe they will all show on the left hand column of the users screen) one at a time and then highlight (select) the data field on the website. From the users highlighting of the data field the script must be able to record each data fields exact position so that in the end: For every data field at every website we want to collect data from, the script will learn and create a record. The record layout will look something like this: Positons: 1-6 website unique number 7-29 data-field-1-name 30-60 data-field-1-name-description (text/decimal/size) 61-90 data-field-1-name-url 91-119 data-field-1-name-page-location (starting row/column) 121-130 current date of data collection 131-140 exact time of data capture 141-150 data-field-1-data 151-180 error-message-if-any blank if none So, if there were 1,000 websites to collect data from and 8 pieces of data to collect from each we should have 8,000 records in the project file that shows the exact location of of piece of data and the data itself along with any error message there might be if the data could not be collected. I.e. The url was no good or the data was supposed to be decimal but the script found text... etc. All of these 8,000 records will be recorded/written during the user interactive second section of the script. Also, with this file you can see how we could selectivly go out and scrape the data for just one website or go to every website and just gather data-field-2 from all of them or... etc. etc. It will be able to do this because in section three of the script, the automated retrieval of the data, it will first read an input record that will contain the information it will use to determin exactly what to do. This auto update section will need to be a cron type job. The third section auto update record will look something like this: Position: 1-9 starting website number 10-20 ending website number 30 If postion 30 has a 1 in it then get data-field-1 if it's zero do not 31 If postion 31 has a 1 in it then get data-field-2 if it's zero do not 32 If postion 32 has a 1 in it then get data-field-3 if it's zero do not 33 " 34 " 35 36 37 38 All the way through data-field-8 From this record we can see that if the starting website number is 1 and the ending number is equal to the last website and all of the datafield characters are set at 1 then the script will go and retrieve all 8 data-fields from all 1000 websites. The fourth section of the script will be the exception reporting. During the auto updating cycle any time the script encounters an error a message should be written on the record as well as to an error report. This error report will discribe the error as best as possible so that a user can use section two of the script to correct the defined position for the error that was encountered.

Coder Needed For Complex Web Scraping Script

$100-500 USD

$100-500 USD

Información sobre el proyecto

¿Buscas ganar dinero?

Beneficios de presentar ofertas en Freelancer

Sobre este cliente

Verificación del cliente

Otros trabajos de este cliente

Trabajos similares