bulding a search engine (data base problem)

We are programming a special application that is using a web crawler and mssql database.

We want to index locally with the crawler about 300 web sites
Then perform a search locally on our database.

The problem is that we think that the data base is not strong or we need to redesign the structure.

Let say we crawl 300 sites and save to the database this data:

1) The link
2) The title
3) The entire HTML source (this is very HD consuming)

We have noticed that 1,000,000 pages with HTML source
Will take about 6 Giga of hard disk space on the mssql server.

We can use very big hard disks but the DB programmer told me that we will never mange to run queries on the html source because there is too much data.....

(Example I want to get all the links with this string – "<a href="#">text</a></li>" )

My point is that surly we can find a way to index locally 300 sites
With Web Crawler and find a smart way to run queries on the HTML source data.....

Google can do it for millions of sites we want to do it for 300.


Any tips? Suggestions ?..


Thanks

Ami.

 

 

 

 

Top