Scrape 500K web pages

Cerrado Publicado Nov 12, 2007 Pagado a la entrega
Cerrado Pagado a la entrega

We are reposting this project because our previous coder was not able to complete the project in time. Please only bid if you are confident you can complete the entire project within 1 week. Note that this project requires the use of multiple instances each using a different ip address. In your bid please briefly describe how you plan to complete this project. We need a developer to scrape approximately 530,000 web pages, cached by a search engine. The developer will go through 26 different search terms, parse each result page for specific urls, and then scrape a cached copy of the actual page. For each cached page, the developer will extract certain bits of content and insert it into our MySQL database with the provided schema. The specific content is conveniently annotated with css so the developer can easily use XPath or simple regular expressions to parse. The content being scraped is in the public domain. The project (including the scrape) should be completed in one week. You should run at least 20 instances/threads in parallel, each using a different ip address. When you restart a thread it needs to use a different ip address. Given 2 seconds per page, we expect the scrape to take less than 1.5 days. It is important to note that the search engine will likely limit the scraper to 500 - 5000 requests per ip address per 24 hour period. Doing the math, for 500,000 pages and given a 24 hour limit of 1000 requests per ip address, this equates to 500 different ip addresses. Of course if you spread out the task over four days, you only need 125 different ip addresses. We will pay for pre-approved server/bandwidth usage. You are free to use your own servers or a virtual server like EC2. We are somewhat platform agnostic, but being rails and java guys, we have a definite preference for solutions in one of the two. For each milestone, the developer will send a snapshot of the latest codebase and we will validate and sign off on the scraped data.

## Deliverables

There are three milestones and you will be paid a partial fee for completing each milestone. For each milestone, the developer will send a snapshot of the latest codebase, instructions to execute the code, and we will validate and sign off on the scraped data. A milestone will not be considered complete until we have validated the accuracy of the data. All 3 projects must be completed within 1 week. 1. Develop the search results scraper and scrape all the cached_profile_urls for the first two search terms. (5%) 2. Develop the profile scraper and scrape the 20 urls we provide. (5%) 3. Scrape each of the profile_cached_urls in the search_results table. (90%)

## Platform

Rails or Java preferred.

Ingeniería Java MySQL Perl PHP Python Ruby on Rails Arquitectura de software Verificación de software XML XSLT

Nº del proyecto: #3467529

Sobre el proyecto

10 propuestas Proyecto remoto Activo Nov 28, 2007

10 freelancers están ofertando un promedio de $808 por este trabajo

kishil

See private message.

$850 USD en 7 días
(30 comentarios)
5.8
Sefidel

See private message.

$850 USD en 7 días
(15 comentarios)
7.0
meteorindia

See private message.

$850 USD en 7 días
(4 comentarios)
4.7
sharkinfo2004

See private message.

$841.5 USD en 7 días
(6 comentarios)
3.6
huyvtrany2k9

See private message.

$850 USD en 7 días
(4 comentarios)
5.2
vw6742929vw

See private message.

$850 USD en 7 días
(7 comentarios)
3.6
sergejv

See private message.

$850 USD en 7 días
(3 comentarios)
2.7
jwbvw

See private message.

$850 USD en 7 días
(3 comentarios)
3.5
hutsolvw

See private message.

$765 USD en 7 días
(1 comentario)
3.2
javajia

See private message.

$525.3 USD en 7 días
(0 comentarios)
0.0