Need HTML spider to catalog and convert video gallery pages into CSV/ZIP(repost)

En curso Publicado Aug 4, 2010 Pagado a la entrega
En curso Pagado a la entrega

Web application with simple one-page interface. I'm open to this being created in PERL or PHP5.2.

Input: A list of URLs of thumbnail/movie gallery HTML pages which have textual, graphical, image, and video content in various formats and layouts.

Output: A CSV dump file which includes the Title, Description, Video thumbnail filename, Video filename (H.264 or FLV format), Video duration (minutes:seconds), Video height (pixels), Video width (pixels) AND a ZIP file including the video and thumbnail files.

Magic:

1) Spider will need to navigate page and catalog all direct links to video files and video files hosted on the page.

1a) Spider will need to forge referral page information when requesting images and videos to get around webserver leaching restrictions.

2) If the video file is in WMV format, convert it to H.264 MP4 and save a local copy for inclusion in ZIP file. If the video file is already in H.264 MP4 or Flash Video (FLV) format, just save a local copy.

3) If the video file is direct linked from the gallery page, and it is linked from this page via a linked image, save that image as the thumbnail for that video.

4) Determine the height and width of the video file and resize the image thumbnail to the same dimensions.

5) If there is no image thumbnail (see #3), create a thumbnail (with the same height/width as the video; jpeg format) from a random frame of the video file.

6) Determine the duration of the video clip and save this for export in the resulting CSV file

7) Save the title of the gallery page as the the title of the video clip.? If there are multiple videos on a single gallery, append "#1", "#2", etc. to the end of the title. 7a) Rename video file and video thumbnail file to correspond to this same title format (to prevent filename collisions with past or future videos).

8) Catalog all text of at least 2 sentences long for export as description of video clips on page. ?

9) Use a predefined list of 'stop' words to disqualify certain sentences from being included in Description collation.

9) Export a CSV file with one line per video clip and all other information described in Output section above. There may be multiple entries if there are multiple videos hosted or linked from the gallery page. ?

10) Export a zip file which includes all of the thumbnails and MP4/FLV video files.

## Deliverables

You will be required to develop and test this software with pages which contain adult content.? You must accept this provision to be selected for this project.

Gallery pages will be provided to use during development and test phases.

Here are some examples of the gallery pages:

[login to view URL]

[login to view URL]

[login to view URL]:revtc:bflc,0

[login to view URL]

[login to view URL]|Uniformed

[login to view URL],1456,2,1,0

[login to view URL]

http://www.sapphic-erotica.com/VGMwMDc3fDc2Mjk=/MjQwMy4zLjEuMS4wLjAuMC4wLjA

[login to view URL]

[login to view URL]

Amazon Web Services Ingeniería Linux Perl PHP Gestión de proyectos Arquitectura de software Verificación de software UNIX XML

Nº del proyecto: #3624338

Sobre el proyecto

4 propuestas Proyecto remoto Activo Aug 9, 2010

4 freelancers están ofertando un promedio de $978 por este trabajo

akkiniraj

See private message.

$297.5 USD en 14 días
(87 comentarios)
6.5
webexpert78

See private message.

$212.5 USD en 14 días
(104 comentarios)
6.1
trkr

See private message.

$2975 USD en 14 días
(67 comentarios)
5.8
erpoojasharma

See private message.

$425 USD en 14 días
(1 comentario)
0.0