Weakness of Lucene -integ(空谷清音) @卡尔加里华人论坛:卡尔加里枫下论坛 The Rolia Forum of Calgary

Weakness of Lucene

integ(空谷清音)

Thanks. But Lucene (Core) :

1) lacks transaction support,

That said, its index could be corrupted if, for example, an indexing machine goes down for whatever reasons (loss of power, human errors, etc) while indexing. And those errors do happen in real life!

Although there were serveral "Directory" implementations (SQLDirectory, DBDirectory, JDBCDirectory) distributed as Sandbox in the past, poor performance and scalibility render them useless in a serious production environment. As such, they have all been removed from the official Sandbox as far as I know.

The Compass framwork on the top of Lucene seems better than the above implementations. But performance is still the major concern.

2) is not good to run in a distributed computing environment.
......

But it should be a good idea to use Lucene as a starting point. Is there anyone here who has done some research on Lucene and its source code so that we can discuss?

(#2625948@0)
Last Updated: 2005-11-24
This post has been archived. It cannot be replied.

没做过搜索引擎，太高深。但是自己玩过，就是写段程序把一个网站的html全部当作一个巨大的string下载下来，分析里面的电子邮箱，各种链接等等。如果是XHtml，则会方便很多，用XSLT可以直接把里面感兴趣的Node分离出来 -binghongcha76(一只大猫); 2005-11-21 {112} (#2619652@0)

我估计google，msn search用的也是相似的原理，定期用这种网络机器人搜索网站里的所有链接。搂主有何高见，小弟愿意听听

i am new to xml/xslt. based on what I know, the xslt is for translate xml to html. does it also able to manupluate html/xhtml? is xhtml xml compatible html? Thanks. -647i(-); 2005-11-21 (#2620864@0)

是这样，说得准确些应该是利用XPath/XSLT。因为XHtml是用标准XML做成的Html网站，其本身是一个复合W3C标准的XML文件。所以我就可以用XPath找到我感兴趣的Node。找到之后我用XSLT把这些node变成我自己需要的另外一种XML文件方式 -binghongcha76(一只大猫); 2005-11-22 {90} (#2621310@0)
这几步都是在XSLT转换的时候自动完成的，因为我需要用XPath在做XSLT转换的时候找到XHtml内的节点

how about DOM? -647i(-); 2005-11-22 (#2621317@0)
u mean xhtml can be trasfered to html by xslt/xpath. how to translate html to xml/xhtml? any translation tool or validation tool? -647i(-); 2005-11-22 (#2621319@0)

DOM我的理解就是一个容器，把XML文件读到DOM里面会生成一个树形结构的object，然后可以对其操作,包括用XPath查找，增加修改或者删除Node. 这些用JavaScript或者.net都很容易实现 -binghongcha76(一只大猫); 2005-11-22 (#2621327@0)

http://www.toronto1.biz/xml 上面地址有我写的一个XML DOM操纵SDK。支持DTD校验，支持DOM操作，支持中文GB2312编码。用C写的，有参考手册。 -walacato(walacato); 2005-11-22 (#2622454@0)

看了，不错，您在那个NetSoft 公司工作？ -binghongcha76(一只大猫); 2005-11-22 (#2622968@0)

理论上当然可以把XHtml变成普通Html,但我看不出来有什么意义。但通过XSLT把HTML转成XHTML我也不知道怎么转，因为HTML文件存在一些不符合XML标准的标记，比如<br>，当XSLT Parser试图读取这种不标准的XML文件时会报错 -binghongcha76(一只大猫); 2005-11-22 {191} (#2621328@0)
具体的转换工具肯定有，但我没用过。我当年接触XHTML只是好奇玩了玩，没有深究过。你可以在google上打入 how to convert HTML to XHTML 试试

用最新的VS2005，里面的所有web site 都已经是默认用XHTML了
use TIDY you can convert any html to XHTML. -schen(睹往睹来.非赌徒也!); 2005-11-22 (#2621816@0)

XSLT并不仅仅是用来把XML转成HTML的，它是把一种XML格式转换成另外一种任意格式（可以是Text,XML，HTML,etc）的标记语言 -binghongcha76(一只大猫); 2005-11-22 (#2621314@0)

这叫做CRAWLER或者SPIDER，做这个东西容错性是关键，无论是HTML/XHTML还是垃圾狗屎（很多网站HTML代码都是乱来的）你都能正确分析出来起码不会CRASH，还有部分网站使用JS动态生成链接，还要INVOKE JS得到结果 -google2002(Google); 2005-11-23 (#2624512@0)

PM sent. -xfile(猪博士◎Joobs); 2005-11-21 (#2620190@0)

Thank everyone for your replies! -integ(空谷清音); 2005-11-21 (#2620561@0)

MARK -647i(-); 2005-11-21 (#2620704@0)

主要看scale -benlin(默默向上游); 2005-11-22 {106} (#2620962@0)

我玩过一段时间 lucene, nutch。功能做出来问题不大，但是如何能适应大批量的网页，大批量的查询，就很多讲究了。

带宽是钱的问题, 内部我用的是 -integ(空谷清音); 2005-11-22 {213} (#2622440@0)
Apache HTTP (Only act as a Load Dispatcher) +
Clustered Application Servers with Session Infinity & Failover enabled +
(Parallel) Database Server
Scalability 应该不是问题。可以加很多台App Servers, DB Servers.

你这个根本不叫搜索引擎，充其量就是后台跑RMDMS TEXT SEARCH的普通网站而已，没有什么技术含量。一个好的搜索引擎，从前台HTTP SERVER，到WEBSPIDER，到文字PARSER和DISPATCHER都应该自己写，而且目前来看只能用C/C++写 -google2002(Google); 2005-11-23 (#2624503@0)

也不尽然，java有Lucene，但是面向中小企业级别的…… -bjrenzx1(机器卡); 2005-11-23 (#2624571@0)

呵呵，我心目中只有GOOGLE, BAIDU这些面向INTERNET用户的才算搜索引擎。自己用C写轻量级HTTP SERVER，一台PC能支撑10,000个并发链接，JAVA估计不行 -google2002(Google); 2005-11-23 (#2624627@0)

那倒是，不过人家的专利算法结构什么的也不会外露……不象Lucene有公共的东西 -bjrenzx1(机器卡); 2005-11-24 (#2624719@0)

如果我现在设计一个搜索引擎，分词算法和RANK比不上google，但是分布式设计方面不会比它差多少 -google_abcd(-1); 2005-11-24 (#2625882@0)

You mean to re-write distruibute transaction services and failover, load balance cross nodes using C++ from scratch, including the communication protocols on top of TCP/IP, if so you are really sombody. -flipper_duckball(忘带枪的战士); 2005-11-25 (#2627718@0)

I am doing these stuff everyday. Actually as long as you get involved in one such project you won't think it is very hard(However, I am not able to implement distributed transcation, too hard). -googleabcd(古狗); 2005-11-25 {173} (#2627843@0)
In fact, I was also doing the same thing in China before. That is why I get a good job in one of the best IT companies of Canada just after 3 months I landed here last year.

Can you tell me any commercial products you are involving in? -flipper_duckball(忘带枪的战士); 2005-11-25 {269} (#2627874@0)
actually I'm interested in the core implementation of failover and load-balance. Do you use the third-party product to implement distributing replicas, some group technologies with distributed objects like JGroup or write it by yourself based on Socket and Multicast?

We don't use any third party library in our load-balance/fail-over design. All codes are written by ourselves using C/C++,Socket,Broaadcast on Unix/Linux platforms -googleabcd(古狗); 2005-11-26 (#2628666@0)

不对吧，今天公司刚学的，如果申请了专利，算法就会被公布，但是别人不能使用罢了。 -naug(xiaoxiao); 2005-12-2 (#2641012@0)

还是你牛啊！ -integ(空谷清音); 2005-11-24 {104} (#2626046@0)
如果你把名字起成Google1995, 那么Google 的老板就不是Sergey Brin 和Larrence Page, 而是阁下你了。开个玩笑。

I am developing a small search engine in spare time. But it is just for fun:) -googleabcd(古狗); 2005-11-25 (#2627845@0)

Google最牛的并不是搜索引擎 -benlin(默默向上游); 2005-11-25 {231} (#2626968@0)
而是他们内部那套文件系统。
http://labs.google.com/papers/gfs-sosp2003.pdf

他们使用廉价的PC，cluster起来实现分布式并行计算，随时哪个节点出错，都可以把任务切换到另外节点上。

你用"google file system"作关键字能找到更多的资料。

what are u going to do? -647i(-); 2005-11-22 (#2622687@0)

Mining and search -integ(空谷清音); 2005-11-24 (#2626038@0)

I am working some work on it, what's your scale? GB, 100GB or TB level? -647i(-); 2005-11-24 (#2626664@0)

基于Java的全文索引引擎Lucene简介:http://www.chedong.com/tech/lucene.html -bjrenzx1(机器卡); 2005-11-23 (#2624578@0)

Weakness of Lucene -integ(空谷清音); 2005-11-24 {954} (#2625948@0)
Thanks. But Lucene (Core) :

1) lacks transaction support,

That said, its index could be corrupted if, for example, an indexing machine goes down for whatever reasons (loss of power, human errors, etc) while indexing. And those errors do happen in real life!

Although there were serveral "Directory" implementations (SQLDirectory, DBDirectory, JDBCDirectory) distributed as Sandbox in the past, poor performance and scalibility render them useless in a serious production environment. As such, they have all been removed from the official Sandbox as far as I know.

The Compass framwork on the top of Lucene seems better than the above implementations. But performance is still the major concern.

2) is not good to run in a distributed computing environment.
......

But it should be a good idea to use Lucene as a starting point. Is there anyone here who has done some research on Lucene and its source code so that we can discuss?

it is said that lucene support chinese text fulltext very well than MySQL. Other database like Oracle....have not done any research yet. -647i(-); 2005-11-24 (#2626674@0)

I am doing clustering algorithm for text categorization. -liyaobin(BigBen); 2005-11-24 (#2626209@0)

another vivisimo-like system? it's too late, I think. -benlin(默默向上游); 2005-11-25 (#2626978@0)

Hoho, so what is not too late in searching/mining field? Please? -liyaobin(BigBen); 2005-11-25 (#2627835@0)

no offence. -benlin(默默向上游); 2005-11-25 {433} (#2628567@0)
my company has a very good team, and implemented a "vivisimo-like search engine" in July, 2004. The timing was pretty good, because Google just went to Nasdaq. But we never got enough VC. So basically I don't see any bright future for such an website again.

I said the team was good, because we have a Hardvard professor, a professional manager in Sillicon Vellage, a technical manager from IBM China, and a senior programmer.

Thx for your info. My "hoho" actually means "xixi", nothing else. I am working on a small-size automatic "google news", mostly from a academic perspective. Any comment? Thx again for your graceful reply. -liyaobin(BigBen); 2005-11-26 (#2629326@0)

我也要帮一个朋友做一个玩玩,以后有问题向大家请教了 -rabbitbug(兔八哥); 2005-11-26 (#2628750@0)

Sohu 第一代 Search Engine 就是本人做的.(部分) . hehe// -mondaycat(catt); 2005-11-26 (#2629329@0)

So? -liyaobin(BigBen); 2005-11-26 (#2629342@0)

@Calgary

Weakness of Lucene

Replies, comments and Discussions:

More Topics