forked from gigablast/open-source-search-engine
-
Notifications
You must be signed in to change notification settings - Fork 0
/
faq.html
1418 lines (1131 loc) · 95.2 KB
/
faq.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!--<html>
<head>
<title>FAQ</title>
<meta http-equiv="Content-Type" content="text/html;charset=utf8" />
</head>
<body text=#000000 bgcolor=#ffffff link=#000000 vlink=#000000 alink=#000000 >
<style>body,td,p,.h{font-family:arial,sans-serif; font-size: 15px;} </style>
<center>
<img border=0 width=500 height=122 src=/logo-med.jpg>
<br><br>
</center>
-->
<div style=max-width:700px;>
<br>
<h1>FAQ</h1>
Developer documentation is <a href=/developer.html>here</a>.
<br><br>
A work-in-progress <a href=/compare.html>comparison to SOLR</a>.
<br>
<br>
<h1>Table of Contents</h1>
<br>
<a href=#quickstart>Quick Start</a><br><br>
<a href=#src>Build from Source</a><br><br>
<a href=#features>Features</a><br><br>
<a href=/admin/api>API</a> - for doing searches, indexing documents and performing cluster maintenance<br><br>
<!--<a href=#weighting>Weighting Query Terms</a> - how to pass in your own query term weights<br><br>-->
<a href=#requirements>Hardware Requirements</a> - what is required to run gigablast
<br>
<br>
<a href=#perf>Performance Specifications</a> - various statistics.
<br>
<br>
<a href=#multisetup>Setting up a Cluster</a> - how to run multiple gb instances in a sharded cluster.
<br>
<br>
<a href=#scaling>Scaling the Cluster</a> - how to add more gb instances.
<br>
<br>
<a href=#scaling>Updating the Binary</a> - follow similar procedure to <i>Scaling the Cluster</i>
<br>
<br>
<a href=#trouble>Cleaning Up after a Crash</a> - how do i make sure my data is in tact after a host crashes?
<br>
<br>
<a href=#spider>The Spider</a> - how does the spider work?
<br>
<br>
<!--<a href=#files>List of Files</a> - the necessary files to run Gigablast
<br>
<br>-->
<a href=#cmdline>Command Line Options</a> - various command line options (coming soon)
<br><br>
<!--
<a href=#clustermaint>Cluster Maintenance</a> - running Gigablast on a cluster of computers.<br><br><a href=#trouble>Troubleshooting</a> - how to fix problems
-->
<!--<br><br><a href=#disaster>Disaster Recovery</a> - dealing with a crashed host-->
<!--<br><br>
<a href=#security>The Security System</a> - how to control access-->
<!--<a href=#build>Building an Index</a> - how to start building your index<br><br>
<a href=#spider>The Spider</a> - all about Gigabot, Gigablast's crawling agent<br><br>-->
<!--<a href=#quotas>Document Quotas</a> - how to limit documents into the index<br><br>-->
<a href=/api.html#/admin/inject>Injecting Documents</a> - inserting documents directly into Gigablast
<br><br>
<a href=/api.html#/admin/inject>Deleting Documents</a> - removing documents from the index
<br><br><a href=#metas>Indexing User-Defined Meta Tags</a> - how Gigablast indexes user-defined meta tags
<!--<br><br><a href=#bigdocs>Indexing Big Documents</a> - what controls the maximum size of a document that can be indexed?-->
<!--<br><br><a href=#rolling>Rolling the New Index</a> - merging the realtime files into the base file-->
<br><br>
<a href=#dmoz>Building a DMOZ Based Directory</a> - build a web directory based on open DMOZ data
<br><br>
<a href=#logs>The Log System</a> - how Gigablast logs information
<br><br>
<a href=#optimizing>Optimizing</a> - optimizing Gigablast's spider and query performance
<!--
<br><br>
<a href=#config>gb.conf</a> - describes the gb configuration file
<br><br>
<a href=#hosts>hosts.conf</a> - the file that describes all participating hosts in the network
<br><br>
<a href=#stopwords>Stopwords</a> - list of common words generally ignored at query time<br><br>
<a href=#phrasebreaks>Phrase Breaks</a> - list of punctuation that breaks a phrase<br><br>
-->
<br><br><a name=quickstart></a>
<h1>Quick Start</h1>
<<i>Last Updated February 2015</i>>
<br>
<br>
<b><font color=red>Requirements:</font></b>
<br><br>
<!--Until I get the binary packages ready, <a href=#src>build from the source code</a>, it should only take about 30 seconds to type the three commands.-->
You will need an Intel or AMD system with at least 4GB of RAM for every gigablast shard you want to run.
<br><br>
<br>
<b><font color=red>For Debian/Ubuntu Linux:</font></b>
<br><br>
1. Download a package: <a href=http://www.gigablast.com/gb_1.19-1_amd64.deb>Debian/Ubuntu 64-bit</a> ( <a href=http://www.gigablast.com/gb_1.19-1_i386.deb>Debian/Ubuntu 32-bit</a> )
<br><br>
2. Install the package by entering: <b>sudo dpkg -i <i><filename></i></b> where filename is the file you just downloaded.
<br><br>
3. Type <b>sudo gb -d</b> to run Gigablast in the background as a daemon.
<br><br>
4. If running for the first time, it could take up to 20 seconds to build some preliminary files.
<br><br>
5. Once running, visit <a href=http://127.0.0.1:8000/>port 8000</a> with your browser to access the Gigablast controls.
<br><br>
6. To list all packages you have installed do a <b>dpkg -l</b>.
<br><br>
7. If you ever want to remove the gb package type <b>sudo dpkg -r gb</b>.
<br><br>
<br>
<b><font color=red>For RedHat/Fedora Linux:</font></b>
<br><br>
1. Download a package: <a href=http://www.gigablast.com/gb-1.19-2.x86_64.rpm>RedHat 64-bit</a> ( <a href=http://www.gigablast.com/gb-1.19-2.i386.rpm>RedHat 32-bit</a> )
<br><br>
2. Install the package by entering: <b>rpm -i --force --nodeps <i><filename></i></b> where filename is the file you just downloaded.
<br><br>
3. Type <b>sudo gb -d</b> to run Gigablast in the background as a daemon.
<br><br>
4. If running for the first time, it could take up to 20 seconds to build some preliminary files.
<br><br>
5. Once running, visit <a href=http://127.0.0.1:8000/>port 8000</a> with your browser to access the Gigablast controls.
<br><br>
<br>
<b><font color=red>For Microsoft Windows:</font></b>
<br><br>
1. If you are running Microsoft Windows, then you will need to install Oracle's <a href=http://www.virtualbox.org/wiki/Downloads><b>VirtualBox for Windows hosts</b></a> software. That will allow you to run Linux in its own window on your Microsoft Windows desktop.
<br><br>
2. When configuring a new Linux virtual machine in VirtualBox, make sure you select at least 4GB of RAM.
<br><br>
3. Once VirtualBox is installed you can download either an
<!--<a href=http://virtualboxes.org/images/ubuntu/>Ubuntu</a> or <a href=http://virtualboxes.org/images/fedora/>RedHat Fedora</a>-->
<a href="http://www.ubuntu.com/download/desktop">Ubuntu CD-ROM Image (.iso file)</a> or a <a href="http://fedoraproject.org/get-fedora">Red Hat Fedora CD-ROM Image (.iso file)</a>.
The CD-ROM Images represent Linux installation CDs.
<br><br>
4. When you boot up Ubuntu or Fedora under VirtualBox for the first time, it will prompt you for the CD-ROM drive, and it will allow you to enter your .iso filename there.
<br><br>
5. Once you finish the Linux installation process
and then boot into Linux through VirtualBox, you can follow the Linux Quick Start instructions above.
<br><br>
<br>
<hr>
<br>
<table cellpadding=5><tr><td colspan=2><b>Installed Files</b></td></tr>
<tr><td><nobr>/var/gigablast/data0/</nobr></td><td>Directory of Gigablast binary and data files</td></tr>
<tr><td>/etc/init.d/gb</td><td>start up script link</td></tr>
<!--<tr><td>/etc/init/gb.conf</td><td>Ubuntu upstart conf file so you can type 'start gb' or 'stop gb', but that will only work on local instances of gb.</td></tr>-->
<tr><td>/usr/bin/gb</td><td>Link to /var/gigablast/data0/gb</td></tr>
</table>
<!--<br><br>
If you run into an bugs let me know so i can fix them right away: [email protected]>
<br>
<br>
<a name=src></a>
<h1>Build From Source</h1>
<<i>Last Updated January 2015</i>>
<br>
<br>
Requirements: You will need an Intel or AMD system running Linux and at least 4GB of RAM.<br><br>
<!--If you run into an bugs let me know so i can fix them right away: [email protected].
<br><br>-->
<!--
You will need the following packages installed<br>
<ul>
<li>For Ubuntu do a <b>apt-get install make g++ gcc-multilib</b>
<br>
For RedHat do a <b>yum install gcc-c++ glibc-static libstdc++-static openssl-static</b>
-->
<!--<li>apt-get install g++
<li>apt-get install gcc-multilib <i>(for 32-bit compilation support)</i>
-->
<!--<li>apt-get install libssl-dev <i>(for the includes, 32-bit libs are here)</i>-->
<!--<li>apt-get install libplot-dev <i>(for the includes, 32-bit libs are here)</i>-->
<!--<li>apt-get install lib32stdc++6-->
<!--<li>apt-get install ia32-libs-->
<!--<li>I supply libstdc++.a but you might need the include headers and have to do <b>apt-get install lib32stdc++6</b> or something.
</ul>
-->
<b>1.0</b> For <u>Ubuntu 12.02 or 14.04</u>: do <b>sudo apt-get update ; sudo apt-get install make g++ libssl-dev binutils</b>
<br><br>
<!--<b>1.1</b> For <u>64-bit Ubuntu 12.02</u>: do <b>sudo apt-get update ; apt-get install make g++ libssl-dev</b>
<br><br>
<<b>1.1</b> For <u>32-bit Ubuntu 14.04</u>: do <b>sudo apt-get update ; apt-get install make g++ gcc-multilib g++-multilib</b>
<br><br>
<b>1.3</b> For <u>32-bit Ubuntu 12.02</u>: do <b>sudo apt-get update ; apt-get install make g++ gcc-multilib </b>
<br><br>
-->
<b>1.1.</b> For <u>RedHat</u> do <b>sudo yum install gcc-c++</b>
<br><br>
<b>2.</b> Download the <a href=https://github.com/gigablast/open-source-search-engine>Gigablast source code</a> using <b>wget --no-check-certificate "https://github.com/gigablast/open-source-search-engine/archive/master.zip"</b>, unzip it and cd into it. (optionally use <b>git clone https://github.com/gigablast/open-source-search-engine.git ./github</b> if you have <i>git</i> installed.)
<br><br>
<b>3.0</b> Run <b>make</b> to compile. (e.g. use 'make -j 4' to compile on four cores)
<br><br>
<b>3.1</b> If you want to compile a 32-bit version of gb for some reason,
run <b>make clean ; make gb32</b>.
<br><br>
<b>4.</b> Run <b>./gb -d</b> to start a single gigablast node which listens on port 8000 running in daemon mode (-d).
<br><br>
<b>5.</b> The first time you run gb, wait about 30 seconds for it to build some files. Check the log file to see when it completes.
<br><br>
<b>6.</b> Go to the <a href=http://127.0.0.1:8000/>root page</a> to begin.
<br>
<br><br><a name=features></a>
<h1>Features</h1>
<<i>Last Updated Jan 2015</i>>
<br>
<ul>
<li> <b>The ONLY open source WEB search engine.</b>
<li> 64-bit architecture.
<li> Scalable to thousands of servers.
<li> Has scaled to over 12 billion web pages on over 200 servers.
<li> A dual quad core, with 32GB ram, and two 160GB Intel SSDs, running 8 Gigablast instances, can do about 8 qps (queries per second) on an index of 10 million pages. Drives will be close to maximum storage capacity. Doubling index size will more or less halve qps rate. (Performance metrics can be made about ten times faster but I have not got around to it yet. Drive space usage will probably remain about the same because it is already pretty efficient.)
<li> 1 million web pages requires 28.6GB of drive space. That includes the index, meta information and the compressed HTML of all the web pages. That is 28.6K of disk per HTML web page.
<li>Spider rate is around 1 page per second per core. So a dual quad core can spider and index 8 pages per second which is 691,200 pages per day.
<li> 4GB of RAM required per Gigablast instance. (instance = process)
<li> Live demo at <a href=http://www.gigablast.com/>http://www.gigablast.com/</a>
<li> Written in C/C++ for optimal performance.
<li> Over 500,000 lines of C/C++.
<li> 100% custom. A single binary. The web server, database and everything else
is all contained in this source code in a highly efficient manner. Makes administration and troubleshooting easier.
<li> Reliable. Has been tested in live production since 2002 on billions of
queries on an index of over 12 billion unique web pages, 24 billion mirrored.
<li> Super fast and efficient. One of a small handful of search engines that have hit such big numbers. The only open source search engine that has.
<li> Supports all languages. Can give results in specified languages a boost over others at query time. Uses UTF-8 representation internally.
<li> Track record. Has been used by many clients. Has been successfully used
in distributed enterprise software.
<li> Cached web pages with query term highlighting.
<li> Shows popular topics of search results (Gigabits), like a faceted search on all the possible phrases.
<li> Email alert monitoring. Let's you know when the system is down in all or part, or if a server is overheating, or a drive has failed or a server is consistently going out of memory, etc.
<li> "Synonyms" based on wiktionary data. Using query expansion method.
<li> Customizable "synonym" file: my-synonyms.txt
<li> No silly TF/IDF or Cosine. Stores position and format information (fancy bits) of each word in an indexed document. It uses this to return results that contain the query terms in close proximity rather than relying on the probabilistic tf/idf approach of other search engines. The older version of Gigablast used tf/idf on Indexdb, whereas it now uses Posdb to hold the index data.
<li> Complete scoring details are displayed in the search results.
<li> Indexes anchor text of inlinks to a web page and uses many techniques to flag pages as link spam thereby discounting their link weights.
<li> Demotes web pages if they are spammy.
<li> Can cluster results from same site.
<li> Duplicate removal from search results.
<li> Distributed web crawler/spider. Supports crawl delay and robots.txt.
<li> Crawler/Spider is highly programmable and URLs are binned into priority queues. Each priority queue has several throttles and knobs.
<li> Spider status monitor to see the urls being spidered over the whole cluster in a real-tiem widget.
<li> Complete REST/XML API for doing queries as well as adding and deleting documents in real-time.
<li> Automated data corruption detection, fail-over and repair based on hardware failures.
<li> Custom Search. (aka Custom Topic Search). Using a cgi parm like &sites=abc.com+xyz.com you can restrict the search results to a list of up to 500 subdomains.
<li> DMOZ integration. Run DMOZ directory. Index and search over the pages in DMOZ. Tag all pages from all sites in DMOZ for searching and displaying of DMOZ topics under each search result.
<li> Collections. Build tens of thousands of different collections, each treated as a separate search engine. Each can spider and be searched independently.
<li> Federated search over multiple Gigablast collections using syntax like &c=mycoll1+mycoll2+mycoll3+...
<li> Plug-ins. For indexing any file format by calling Plug-ins to convert that format to HTML. Provided binary plug-ins: pdftohtml (PDF), ppthtml (PowerPoint), antiword (MS Word), pstotext (PostScript).
<li> Indexes JSON and XML natively. Provides ability to search individual structured fields.
<li> Sorting. Sort the search results by meta tags or JSON fields that contain numbers, simply by adding something like gbsortby:price or gbrevsortby:price as a query term, assuming you have meta price tags.
<li> Easy Scaling. Add new servers to the <a href=/hosts.conf.txt>hosts.conf</a> file then click 'rebalance shards' to automatically rebalance the sharded data.
<li> Using &stream=1 can stream back millions of search results for a query without running out of memory.
<li> Makes and displays thumbnail images in the search results.
<li> Nested boolean queries using AND, OR, NOT operators.
<li> Built-in support for <a href=http://www.diffbot.com/products/automatic/>diffbot.com's api</a>, which extracts various entities from web sites, like products, articles, etc. But you will need to get a free token from them for access to their API.
<li> Facets over meta tags or X-Paths for HTML documents.
<li> Facets over JSON and XML fields.
<li> Sort and constrain by numeric fields in JSON or XML.
<li> Built-in real-time profiler.
<li> Built-in QA tester.
<li> Can inject WARC and ARC archive files.
<li> Spellchecker will be renabled shortly.
</ul>
<h2>Coming Soon</h2>
<ul>
<li> file:// "spidering" support
<li> smb:// "spidering" support
<li> Query completion
<li> Improved plug-in support
</ul>
<br>
<!--
<br><br><a name=weighting></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Weighting Query Terms</td></tr></table>
<br><br>
Gigablast allows you to pass in weights for each term in the provided query. The query term weight operator, which is directly inserted into the query, takes the form: <b>[XY]</b>, where <i>X</i> is the weight you want to apply and <i>Y</i> is <b><i>a</i></b> if you want to make it an absolute weight or <b><i>r</i></b> for a relative weight. Absolute weights cancel any weights that Gigablast may place on the query term, like weights due to the term's popularity, for instance. The relative weight, on the other hand, is multiplied by any weight Gigablast may have already assigned.<br><br>
The query term weight operator will affect all query terms that follow it. To turn off the effects of the operator just use the blank operator, <b>[]</b>. Any weight operators you apply override any previous weight operators.<br><br>
The weight applied to a phrase is unaffected by the weights applied to its constituent terms. In order to weight a phrase you must use the <b>[XYp]</b> operator. To turn off the affects of a phrase weight operator, use the phrase blank operator, <b>[p]</b>.<br><br>
Applying a relative weight of 0 to a query term, like <b>[0r]</b>, has the effect of still requiring the term in the search results (if it was not ignored), but not allowing it to contribute to the ranking of the search results. However, when doing a default OR search, if a document contains two such terms, it will rank above a document that only contains one such term. <br><br>
Applying an absolute weight of 0 to a query term, like <b>[0a]</b>, causes it to be completely ignored and not used for generating the search results at all. But such ignored or devalued query terms may still be considered in a phrase context. To affect the phrases in a similar manner, use the phrase operators, <b>[0rp]</b> and <b>[0ap]</b>.<br><br>
Example queries:<br><br>
<b>[10r]happy [5rp][13r]day []lucky</b><br>
<i>happy</i> is weighted 10 times it's normal weight.<br>
<i>day</i> is weighted 13 times it's normal weight.<br>
<i>"day lucky"</i>, the phrase, is weighted 5 times it's normal weight.<br>
<i>lucky</i> is given it's normal weight assigned by Gigablast.<br><br>
Also, keep in mind not to use these weighting operators between another query operator, like '+', and its affecting query term. If you do, the '+' or '-' operator will not work.<br><br>
-->
<a name=requirements></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Hardware Requirements</td></tr></table>
<br>
<<i>Last Updated January 2014</i>>
<br>
<br>
At least one computer with 4GB RAM, 10GB of hard drive space and any distribution of Linux with the 2.4.25 kernel or higher. For decent performance invest in Intel Solid State Drives. I tested other brands around 2010 and found that they would freeze for up for 500ms every hour or so to do "garbage collection". That is unacceptable in general for a search engine.
Plus, Gigablast, reads and writes a lot of data at the same time under heavy spider and query loads, therefore disk will probably be your MAJOR bottleneck.<br><br>
<br>
<a name=perf></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Performance Specifications</td></tr></table>
<br>
<<i>Last Updated January 2014</i>>
<br>
<br>
Gigablast can store 100,000 web pages (each around 25k in size) per gigabyte of disk storage. A typical single-cpu pentium 4 machine can index one to two million web pages per day even when Gigablast is near its maximum document capacity for the hardware. A cluster of N such machines can index at N times that rate.<br><br>
<br>
<!--
<a name=files></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>List of Files</td></tr></table>
<br>
<b>1.</b> Create one directory for every Gigablast process you would like to run. Each Gigablast process is also called a <i>host</i> or a <i>node</i>. Multiple processes can exist on one physical server and is usually done to take advantage of mutiple cores, one process per core.<br><br>
<b>2.</b>
Each directory should have the following files and subdirectories:<br><br>
<table cellpadding=3>
<tr><td><b>gb</b></td><td>The Gigablast executable. Contains the web server, the database and the spider. This file is required to run gb. It will be created when gb first runs.</td></tr>
<tr><td><b>hosts.conf</b></td><td><a href=/hosts.conf>example</a>. This file describes each host (gb process) in the Gigablast network. Every gb process uses the same hosts.conf file. This file is required to run gb.
</td></tr>
<tr><td><b>gb.conf</b></td><td><a href=/gb.conf.txt>example</a>. Each gb process is called a <i>host</i> and each gb process has its own gb.conf file. This file is required to run gb.<tr><td><b>coll.XXX.YYY/</b></td><td>For every collection there is a subdirectory of this form, where XXX is the name of the collection and YYY is the collection's unique id. Contained in each of these subdirectories is the data associated with that collection.</td></tr>
-->
<!--<tr><td><b>coll.XXX.YYY/coll.conf</b></td><td>Each collection contains a configuration file called coll.conf. This file allows you to configure collection specific parameters. Every parameter in this file is also controllable via your the administrative web pages as well.</td></tr>
-->
<!--
<tr><td><b>trash/</b></td><td>Deleted collections are moved into this subdirectory. A timestamp in milliseconds since the epoch is appended to the name of the deleted collection's subdirectory after it is moved into the trash sub directory. Gigablast doesn't physically delete collections in case it was a mistake.</td></tr>
<tr><td><b>html/</b></td><td>A subdirectory that holds all the html files and images used by Gigablast. Includes Logos and help files.</tr>
<tr><td><b>antiword</b></td><td>Executable called by gbfilter to convert Microsoft Word files to html for indexing.</tr>
<tr><td><b>antiword-dir/</b></td><td>A subdirectory that contains information needed by antiword.</tr>
<tr><td><b>pdftohtml</b></td><td>Executable called by gbfilter to convert PDF files to html for indexing.</tr>
<tr><td><b>pstotext</b></td><td>Executable called by gbfilter to convert PostScript files to text for indexing.</tr>
<tr><td><b>ppthtml</b></td><td>Executable called by gbfilter to convert PowerPoint files to html for indexing.</tr>
<tr><td><b>xlhtml</b></td><td>Executable called by gbfilter to convert Microsoft Excel files to html for indexing.</tr>
</table>
-->
<!--<tr><td><b>gbfilter</b></td><td>Simple executable called by Gigablast with document HTTP MIME header and document content as input. Output is an HTTP MIME and html or text that can be indexed by Gigablast.</tr>-->
<!--<tr><td><b><a href=#gbstart>gbstart</a></b></td><td>An optional simple script used to start up the gb process(es) on each computer in the network. Otherwise, iff you have passwordless ssh capability then you can just use './gb start' and it will spawn an ssh command to start up a gb process for each host listed in hosts.conf.</tr>-->
<!--
</table>
<br><br>
<br>
-->
<a name=multisetup></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Setting up a Cluster</td></tr></table>
<br>
<<i>Last Updated July 2014</i>>
<br>
<br>
1. Locate the <a href=/hosts.conf.txt>hosts.conf</a> file. If installing from binaries it should be in the /var/gigablast/data0/ directory. If it does not exist yet then run <b>gb</b> or <b>./gb</b> which will create one. You will then have to exit gb after it does.
<br><br>
2. Update the <b>num-mirrors</b> in the <a href=/hosts.conf.txt>hosts.conf</a> file. Leave it as 0 if you do not want redundancy. If you want each shard to be mirrored by one other gb instance, then set this to 1. I find that 1 is typically good enough, provided that the mirror is on a different physical server. So if one server gets trashed there is another to serve that shard. The sole advantage in not mirroring your cluster is that you will have twice the disk space for storing documents. Query speed should be unaffected because Gigablast is smart enough to split the load evenly between mirrors when processing queries. You can send your queries to any shard and it will communicate with all the other shards to aggregate the results. If one shard fails and you are not mirroring then you will lose that part of the index, unfortunately.
<br><br>
3. Make one entry in the <a href=/hosts.conf.txt>hosts.conf</a> per physical core you have on your server. If an entry is on the same server as another, then it will need a completely different set of ports. Each gb instance also requires 4GB of ram, so you may be limited by your RAM before being limited by your cores. You can of course run multiple gb instances on a single core if you have the RAM, but performance will not be optimal.
<br><br>
4. Continue following the instructions for <a href=#scaling>Scaling the Cluster</a> below in order to get the other shards set up and running.
<br>
<br>
<br>
<a name=scaling></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Scaling the Cluster</td></tr></table>
<br>
<<i>Last Updated June 2014</i>>
<br>
<br>
1. If your spiders are active, then turn off spidering in the <a href=/admin/master>master controls</a>.
<br><br>
2. If your cluster is running, shut down the clustering by doing a <b>gb stop</b> command on the command line OR by clicking on "save & exit" in the <a href=/admin/master>master controls</a>
<br><br>
3. Edit the <a href=/hosts.conf.txt>hosts.conf</a> file in the working directory of host #0 (the first host entry in the hosts.conf file) to add the new hosts.
<br><br>
4. Ensure you can do passwordless ssh from host #0 to each new IP address you added. This generally requires running <b>ssh-keygen -t dsa</b> on host #0 to create the files <i>~/.ssh/id_dsa</i> and <i>~/.ssh/id_dsa.pub</i>. Then you need to insert the key in <i>~/.ssh/id_dsa.pub</i> into the <i>~/.ssh/authorized_keys2</i> file on every host, including host #0, in your cluster. Furthermore, you must do a <b>chmod 700 ~/.ssh/authorized_keys2</b> on each one otherwise the passwordless ssh will not work.
<br><br>
5. Run <b>gb install <hostid></b> on host #0 for each new hostid to copy the required files from host #0 to the new hosts. This will do an <i>scp</i> which requires the passwordless ssh. <hostid> can be a range of hostids like <i>5-12</i> as well.
<br><br>
6. Run <b>gb start</b> on the command line to start up all gb instances/processes in the cluster.
<br><br>
7. If your index was not empty, then click on <b>rebalance shards</b> in the <a href=/admin/master>master controls</a> to begin moving data from the old shards to the new shards. The <a href=/admin/hosts>hosts table</a> will let you know when the rebalance operation is complete. It should be able to serve queries during the rebalancing, but spidering can not resume until it is completed.
<br>
<br>
<br>
<!--
<a name=clustermaint></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Cluster Maintenance</td></tr></table>
<br>
<<i>Caution: Old Documentation</i>>
<br>
<br>
For the purposes of this section, we assume the name of the cluster is gf and all hosts in the cluster are named gf*. The Master host of the cluster is gf0. The gigablast working directory is assumed to be /a/ . We all assume you can do passwordless ssh from one machine to another, otherwise administration of hundreds of servers is not fun!
<br>
<br>
<b>To setup dsh:</b>
<ul>
<li> Install the dsh package, on debian it would be:<br> <b> $ apt-get install dsh</b><br>
<li>Go to the working directory in your bash shell and type <b>./gb dsh hostname | sort | uniq > all</b> to add the hostname of each server to the file <i>all</i>.
<br></ul>
<b>To setup dsh on a machine on which we do not have root:</b>
<ul>
<li>cd to the working directory
<li>Copy /usr/lib/libdshconfig.so.1.0.0 to the working directory.
<li><b>export LD_PATH=.</b>
</ul>
<b>To use the dsh command:</b>
<ul>
<li>run <b>dsh -c -f all hostname</b> as a test. It should execute the hostname command on all servers listed in the file <i>all</i>.
<li>to copy a master configuration file to all hosts:<br>
<b>$ dsh -c -f all 'scp gf0:/a/coll.conf /a/coll.conf'</b><br>
<li>to check running processes on all machines concurrently (-c option):<br>
<b>$ dsh -c -f all 'ps auxww'</b><br>
</ul>
<b>To prepare a new cluster or erase an old cluster:</b><ul>
<li>Save <b>/a/gb.conf</b>, <b>/a/hosts.conf</b>, and <b>/a/coll.*.*/coll.conf</b> files somewhere besides on /dev/md0 if they exist and you want to keep them.
<li>cd to a directory not on /dev/md0
<li>Login as root using <b>su</b>
<li>Use <b>dsh -c -f all 'umount /dev/md0'</b> to unmount the working directory. All login shells must exit or cd to a different directory, and all processes with files opened in /dev/md0 must exit for the unmount to work.
<li>Use <b>dsh -c -f all 'umount /dev/md0'</b> to unmount the working directory.
<li>Use <b>dsh -c -f all 'mke2fs -b4096 -m0 -N20000 -R stride=32 /dev/md0'</b> to revuild the filesystem on the raid. CAUTION!!! WARNING!! THIS COMPLETELY ERASES ALL DATA ON /dev/md0
<li>Use <b>dsh -c -f all 'mount /dev/md0'</b> to remount it.
<li>Use <b>dsh -c -f all 'mkdir /mnt/raid/a ; chown mwells:mwells /mnt/raid/a</b> to create the 'a' directory and let user mwells, or other search engine administrator username, own it.
<li>Recopy over the necessary gb files to every machine.
</ul>
<br>
<b>To test a new gigablast executable:</b><ul>
<li>Change to the gigablast working directory.<br> <b>$ cd /a</b><li>Stop all gb processes on hosts.conf.<br> <b>$ gb stop</b><li>Wait until all hosts have stopped and saved their data. (the following line should not print anything)<br> <b>$ dsh -a 'ps auxww' | grep gb</b>
<li>Copy the new executable onto gf0<br> <b>$ scp gb user@gf0:/a/</b><li>Install the executable on all machines.<br> <b>$ gb installgb</b><br><li>This will copy the gb executable to all hosts. You must wait until all of the scp processes have completed before starting the gb process. Run ps to verify that all of the scp processes have finished.<br> <b>$ ps auxww</b><li>Run gb start<br> <b>$ gb start </b><li>As soon as all of the hosts have started, you can use the web interface to gigablast.<br></ul>
<b>To switch the live cluster from the current (cluster1) to another (cluster2):</b><ul>
<li>Ensure that the gb.conf of cluster2 matches that of cluster1, excluding any desired changes.<br><li>Ensure that the coll.conf for each collection on cluster2 matches those of cluster1, excluding any desired changes.<br><li>Thoroughly test cluster2 using the blaster program.<br><li>Test duplicate queries between cluster1 and cluster2 and ensure results properly match, with the exception of any known new changes.<br><li>Make sure port 80 on cluster2 is directing to the correct port for gb.<br> <b>$ iptables -t nat -A PREROUTING -i eth0 -p tcp -m tcp --dport 80 -j DNAT --to-destination 2.2.2.2:8000</b><br><li>Test that cluster2 works correctly by accessing it from a browser using only it's IP in the address.<br><li>For both primary and secondary DNS servers, perform the following:<br><ul><li>Edit /etc/bind/db.<hostname> (i.e. db.gigablast.com)<br> <b>$ vi /etc/bind/db.gigablast.com</b><br> <li>Change lines using cluster1's ip to have cluster2's ip. It is recommended that comment out the old line with a ; at the front.<br> <b>i.e.: "www IN A 1.1.1.1" >> "www IN A 2.2.2.2"</b><br> <li>Edit /etc/bind/db.64<br> <b>$ vi /etc/bind/db.64</b><br> <li>Change lines with cluster1's last IP number to have cluster2's last IP number.<br> <b>i.e.: "1 IN PTR www.gigablast.com" >> "2 IN PTR www.gigablast.com"</b><br> <li>Restart named.<br> <b>$ /etc/rc3.d/S15bind9 restart</b><br></ul><li>Again, test that cluster2 works correctly by accessing it from a browser using only it's IP in the address.<br><li>Check log0 of cluster2 to make sure it is recieving queries.<br> <b>$ tail -f /a/log0</b><br><li>Allow cluster1 to remain active until all users have switched over to cluster2.<br></ul><br>
-->
<a name=trouble></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Cleaning Up After a Crash</td></tr></table>
<br>
<<i>Last Updated Sep 2014</i>>
<br>
<!--
<br>
<a name=disaster></a>
<b>A host in the network crashed. How do I temporarily decrease query latency on the network until I get it up again?</b><br>You can go to the <i>Search Controls</i> page and cut all nine tier sizes in half. This will reduce search result recall, but should cut query latency times in half for slower queries until the crashed host is recovered.<br>-->
<br><b>A host in the network crashed. What is the recovery procedure?</b><br>First determine if the host's crash was clean or unclean. It was clean if the host was able to save all data in memory before it crashed. If the log ended with <i>allExit: dumping core after saving</i> then the crash was clean, otherwise it was not.<br><br>If the crash was clean then you can simply restart the crashed host by typing <b>gb start <i>i</i></b> where <i>i</i> is the hostId of the crashed host. However, if the crash was not clean, like in the case of a sudden power outtage, then in order to ensure no data gets lost, you must copy the data of the crashed host's twin. If it does not have a twin then there may be some data loss and/or corruption. In that case try reading the section below, <i>How do I minimize the damage after an unclean crash with no twin?</i>, but you may be better off starting the index build from scratch. To recover from an unclean crash using the twin, follow the steps below: <br><br>a. Click on 'all spiders off' in the 'master controls' of host #0, or host #1 if host #0 was the host that crashed.<br>b. If you were injecting content directly into Gigablast, stop.<br>c. Click on 'all just save' in the 'master controls' of host #0 or host #1 if host #0 was the one that crashed.<br>d. Determine the twin of the crashed host by looking in the <a href=/hosts.conf.txt>hosts.conf</a> file or on the <a href=/admin/hosts>hosts</a> page. The twin will have the same shard number as the crashed host.<br>e. Recursively copy the working directory of the twin to the crashed host using rcp since it is much faster than scp.<br>f. Restart the crashed host by typing <b>gb start <i>i</i></b> where <i>i</i> is the hostId of the crashed host. If it is not restartable, then skip this step.
<!--
<br>g. If the crashed host was restarted, wait for it to come back up. Monitor another host's <i>hosts</i> table to see when it is up, or watch the log of the crashed host.<br>h. If the crashed host was restarted, wait a minute for it to absorb all of the data add requests that may still be lingering. Wait for all hosts' <i>spider queues</i> of urls currently being spidered to be empty of urls.<br>i. Perform another <i>all just save</i> command to relegate any new data to disk.<br>j. After the copy completes edit the hosts.conf on host #0 and replace the ip address of the crashed host with that of the spare host.<br>k. Do a <b>gb stop</b> to safely shut down all hosts in the network.<br>l. Do a <b>gb installconf</b> to propagate the hosts.conf file from host #0 to all other hosts in the network (including the spare host, but not the crashed host)<br>m. Do a <b>gb start</b> to bring up all hosts under the new hosts.conf file.<br>n. Monitor all logs for a little bit by doing <i>dsh -c -f all 'tail -f /a/log? /a/log??'</i><br>o. Check the <i>hosts</i> table to ensure all hosts are up and running.
-->
<br><br><br><b>How do I minimize the damage after an unclean crash with no twin?</b><br>You may never be able to get the index 100% back into shape right now, but in the near future there may be some technology that allows gigablast to easily recover from these situations. For now though, 2. Try to determine the last url that was indexed and *fully* saved to disk. Every time you index a url some data is added to all of these databases: checksumdb, posdb (index), spiderdb, titledb and clusterdb. These databases all have in-memory data that is periodically dumped to disk. So you must determine the last time each of these databases dumped to disk by looking at the timestamp on the corresponding files in the appropriate collection subdirectories contained in the working directory. If clusterdb was dumped to disk the longest time ago, then use its timestamp to indicate when the last url was successfully added or injected. You might want to subtract thirty minutes from that timestamp to make sure because it is really the time that that file <b>started</b> being dumped to disk that you are after, and that timestamp represents the time of the last write to that file. Now you can re-add the potentially missing urls from that time forward using the <a href=/admin/addurl>AddUrl page</a> and get a semi-decent recovery.
<br>
<!--
<br><br><b>I get different results for the XML feed (raw=X) as compared to the HTML feed. What is going on?</b><br> Try adding the &rt=1 cgi parameter to the search string to tell Gigablast to return real time results.rt is set to 0 by default for the XML feed, but not for the HTML feed. That means Gigablast will only look at the root indexdb file when looking up queries. Any newly added pages will be indexed outside of the root file until a merge is done. This is done for performance reasons. You can enable real time look ups by adding &rt=1 to the search string. Also, in your search controls there are options to enable or disable real time lookups for regular queries and XML feeds, labeled as "restrict indexdb for queries" and "restrict indexdb for xml feed". Make sure both regular queries and xml queries are doing the same thing when comparing results.<br><br>Also, you need to look at the tier sizes at the top of the Search Controls page. The tier sizes (tierStage0, tierStage1, ...) listed for the raw (XML feed) queries needs to match non-raw in order to get exactly the same results. Smaller tier sizes yield better performance but yield less search results.
-->
<!--
<br><br><b>The spider is on but no urls are showing up in the Spider Queue table as being spidered. What is wrong?</b><br><table width=100%><tr><td>1. Set <i>log spidered urls</i> to YES on the <i>log</i> page. Then check the log to see if something is being logged.</td></tr><tr><td>2. Check the <a href=/admin/master>master controls</a> page for the following:<br> a. the <i>spider enabled</i> switch is set to YES.
-->
<!--<br> c. the <i>spider max kbps</i> control is set high enough.</td></tr></td></tr><tr><td>3. Check the <i>spider controls</i> page for the following:-->
<!--<br> c. the <i>spider delay</i> control is not TOO HIGH.-->
<!--
</td></tr></td></tr><tr><td>
3. Check the <a href=/admin/spider>spider controls</a> page for the following:
<br> a. the collection you wish to spider for is selected (in red).
<br> a. the <i>spidering enabled</i> is set to YES.
<br> a. the <i>max spiders</i> is not to LOW.
<br> c. the <i>spider delay in milliseconds</i> control is not TOO HIGH.
<br> b. the appropriate <i>spidering enabled</i> checkboxes in the URL Filters page are checked.
-->
<!--<br> c. the <i>spider start</i> and <i>end times</i> are set appropriately.-->
<!--<br> d. the <i>use current time</i> control is set correctly.-->
<!--</td></tr><tr><td>4. Make sure you have urls to spider by running 'gb dump s <collname>' on the command line to dump out spiderdb. See 'gb -h' for the help menu and more options.-->
<!--
</td></tr>
</table>
-->
<!-- If they are mostly "getting cached web page" and the IP address column is mostly empty, then Gigablast may be bogged down looking up the cached web pages of each url in the spider queue only to discover it is from a domain that was just spidered. This is a wasted lookup, and it can bog things down pretty quickly when you are spidering a lot of old urls from the same domain. Try setting <i>same domain wait</i> and <i>same ip wait</i> both to 0. This will pound those domain's server, though, so be careful. Maybe set it to 1000ms or so instead. We plan to fix this in the future.
-->
<!--
<br><br><a name=security></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>The Security System
</td></tr></table>
<br>
Right now any local IP can adminster Gigablast, so any IP on the same network with a netmask of 255.255.255.0 can get in. There was an accounting system but it was disabled for simplicity. So we need to at least partially re-enable it, but still keep things simple for single administrators on small networks.
Every request sent to the Gigablast server is assumed to come from one of four types of users. A public user, a spam assassin, a collection admin, or a master admin. A collection admin has control over the controls corresponding to a particular collection. A spam assassin has control over even fewer controls over a particular collection in order to remove pages from it. A master admin has control over all aspects and all collections. <br><br>To verify a request is from an admin or spam assassin Gigablast requires that the request contain a password or come from a listed IP. To maintain these lists of passwords and IPs for the master admin, click on the "security" tab. To maintain them for a collection admin or for a spam assassin, click on the "access" tab for that collection. Alternatively, the master passwords and IPs can be edited in the gb.conf file in the working dir and collection admin passwords and IPs can be edited in the coll.conf file in the collections subdirectory in the working dir. <br><br>To add a further layer of security, Gigablast can server all of its pages through the https interface. By changing http:// to https:// and using the SSL port you specified in hosts.conf, all requests and responses will be made secure.-->
<br><br>
<!--
<a name=build></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Building an Index
</td></tr></table>
<br>
<b>1.</b> Determine a collection name for your index. You may just want to use the default, unnamed collection. Gigablast is capable of handling many sub-indexes, known as collections. Each collection is independent of the other collections. You can add a new collection by clicking on the <b>add new collection</b> link on the <a href="/admin/spider">Spider Controls</a> page.<br><br>
<b>2.</b> Goto the <a href=/admin/settings>settings</a> page and add the sites you want. Only add seeds if you want to spider the whole web.
-->
<!--
<b>2.</b> Add rules to the <a href="/admin/filters">URL Filters page</a>. This is like a routing table but for URLs. The first rule that a URL matches will determine what priority queue it is assigned to. You can also use <a href="http://www.phpbuilder.com/columns/dario19990616.php3">regular expressions</a>. The special keywords you can used are described at the bottom of the rule table.
<br><br>
On that page you can tell Gigablast how often to
re-index a URL in order to pick up any changes to that URL's content.
You can assign a spider priority, the maximum number of outstanding spiders for that rule, the re-spider frequency and how long to wait before spidering another url in that same priority. It would be nifty to have an infile:myfile.txt rule that would match if the URL's subdomain were in that file, myfile.txt, however, until that is added you can added your file of subdomains to tagdb and set a tag field, such as <i>ruleset</i> to 3. Then you can say 'tag:ruleset==3' as one of your rules to capture them. This works because tagdb is hiearchical like that.
<br><br>
-->
<!--<b>3.</b> Test your Regular Expressions. Once you've submitted your
regular expressions try entering some URLs in the second pink box, entitled,
<i>URL Filters Test</i> on the <a href="/admin/filters">URL Filters page</a>. This will help you make sure that you've entered your regular expressions correctly. (NOTE: something happened to this box. It is missing and needs to be put back.)
<br><br>-->
<!--
<b>4.</b> Enable "add url". By enabling the add url interface you will be able to tell Gigablast to index some URLs. You must make sure add url is enabled on the <a href="/admin/master">Master Controls</a> page and also on the <a href="/admin/spider">Spider Controls</a> page for your collection. If it is disabled on the Master Controls page then you will not be able to add URLs for *any* collection.
<br><br>
<b>5.</b> Submit some seed URLs. Go to the <a href="/addurl">add url
page</a> for your collection and submit some URLs you'd like to put in your
index. Usually you want these URLs to have a lot of outgoing links that
point to other pages you would like to have in your index as well. Gigablast's
spiders will follow these links and index whatever web pages they point to,
then whatever pages the links on those pages point to, ad inifinitum. But you
must make sure that <b>spider links</b> is enabled on the <a href="/admin/spider">Spider Controls</a> page for your collection.
<br><br>
<b>5.a.</b> Check the spiders. You can go to the <b>Spider Queue</b> page to see what urls are currently being spidered from all collections, as well as see what urls exist in various priority queues, and what urls are cached from various priority queues. If you urls are not being spidered check to see if they are in the various spider queues. Urls added via the add url interface usually go to priority queue 5 by default, but that may have been changed on the Spider Controls page to another priority queue. And it may have been added to any of the hosts' priority queue on the network, so you may have to check each one to find it.<br><br>
If you do not see it on any hosts you can do an <b>all just save</b> in the Master Controls on host #0 and then dump spiderdb using gb's command line dumping function, <b>gb dump s 0 -1 1 -1 5</b> (see gb -h for help) on every host in the cluster and grep out the url you added to see if you can find it in spiderdb.<br><br>Then make sure that your spider start and end time on the Spider Controls encompas, and spidering is enabled, and spidering is enabled for that priority queue. If all these check out the url should be spidered asap.<br><br>
<b>6.</b> Regulate the Spiders. Given enough hardware, Gigablast can index
millions of pages PER HOUR. If you don't want Gigablast to thrash your or
someone else's website
then you should adjust the time Gigablast waits between page requests to the
same web server. To do this go to the
<a href="/admin/spider">Spider Controls</a> page for your collection and set
the <b>spider delay in milliseconds</b> value to how long you want Gigablast to wait in between page requests. This value is in milliseconds (ms). There are 1000 milliseconds in one second. That is, 1000 ms equals 1 second.
You must then click on the
<i>update</i> button at the bottom of that page to submit your new value. Or just press enter.
<br><br>
<b>7.</b> Turn on the new spider. Go to the
<a href="/admin/spider">Spider Controls</a> page for your collection and
turn on <b>spidering enabled</b>. It should be at the top of the
controls table. You may also have to turn on spidering from the
<a href="/admin/master">Master Controls</a> page which is a master switch for all
collections.
<br><br>
<b>8.</b> Monitor the spider's progress. By visiting the
<a href="/admin/spiderdb">Spider Queue</a> page for your collection you can see what
URLs are currently being indexed in real-time. Gigablast.com currently has 32hosts and each host spiders different URLs. You can easily switch between
these hosts by clicking on the host numbers at the top of the page.
-->
<!--<br><br><br>-->
<a name=spider></>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>The Spider
</td></tr></table>
<br>
<<i>Last Updated Sep 2014</i>>
<br>
<br>
<b>Robots.txt</b>
<br><br>
The name of Gigablast's spider is Gigabot, but it by default uses GigablastOpenSource as the name of the User-Agent when downloading web pages.
Gigabot respects the <a href=/spider.html>robots.txt convention</a> (robot exclusion) as well as supporting the meta noindex, noarchive and nofollow meta tags. You can tell Gigabot to ignore robots.txt files on the <a href="/admin/spider">Spider Controls</a> page.
<br><br>
<a name="spiderqueue">
<b>Spider Queues</b>
<br><br>
You can tell Gigabot what to spider by using the <i>site list</i> on the <a href=/admin/settings>Settings</a> page. You can have very precise control over the spider by also employing the use of the <a href=/admin/filters>URL Filters</a> page which allows you to prioritize and schedule the spiders based on the individual URL and many of its associated attributes, such as hop count, language, parent language, whether is is indexed already and number of inlinks to its site, to name just a smidgen.
<br><br>
<br>
<!--
<b><a name=classifying>Classifying URLs</a></b>
<br><br>
You can specify different indexing and spider parameters on a per URL basis by one or more of the following methods:
<br><br>
<ul>
<li>Using the <a href="/admin/tagdb">tagdb interface</a>, you can assign a <a href=#ruleset>ruleset</a> to a set of sites. All you do is provide Gigablast with a list of sites and the ruleset to use for those sites.
You can enter the sites via the <a href="/admin/tagdb">HTML form</a> or you can provide Gigablast with a file of the sites. Each file must be limited to 1 Megabyte, but you can add hundreds of millions of sites.
Sites can be full URLs, hostnames, domain names or IP addresses.
If you add a site which is just a canonical domain name with no explicit host name, like gigablast.com, then any URL with the same domain name, regardless of its host name will match that site. That is, "hostname.gigablast.com" will match the site "gigablast.com" and therefore be assigned the associated ruleset.
Sites may also use IP addresses instead of domain names. If the least significant byte of an IP address that you submit to tagdb is 0 then any URL with the same top 3 IP bytes as that IP will be considered a match.
<li>You can specify a regular expression to describe a set of URLs using the interface on the <a href="/admin/filters"></a>URL filters</a> page. You can then assign a <a href=#ruleset>ruleset</a> that describes how to spider those URLs and how to index their content. Currently, you can also explicitly assign a spider frequency and spider queue to matching URLs. If these are specified they will override any values in the ruleset.</ul>
If the URL being spidered matches a site in tagdb then Gigablast will use the corresponding ruleset from that and will not bother searching the regular expressions on the <a href="/admin/filters"></a>URL filters</a> page.
-->
<!--
<br><br>
Gigablast uses spider queues to hold and partition URLs. Each spider queue has an associated priority which ranges from 0 to 127.
Furthermore, each queue is either denoted as <i>old</i> or <i>new</i>. Old spider queues hold URLs whose content is currently in the index. New spider queues hold URLs whose content is not in the index. The priority of a URL is the same as the priority of the spider queue to which it belongs. You can explicitly assign the priority of a URL by specifying it in a <a href=#ruleset>ruleset</a> to which that URL has been assigned or by assigning it on the <a href="/admin/filters"></a>URL filters</a> page.
<br><br>
On the <a href="/admin/spider">Spider Controls</a> page you can toggle the spidering of individual spider queues as well as link harvesting. More control on a per queue basis will be available soon, perhaps including the ability to assign a ruleset to a spider queue.
<br><br>
The general idea behind spider queues is that it allows Gigablast to prioritize its spidering. If two URLs are overdue to be spidered, Gigabot will download the one in the spider queue with the highest priority before downloading the other. If the two URLs have the same spider priority then Gigabot will prefer the one in the new spider queue. If they are both in the new queue or both in the old queue, then Gigabot will spider them based on their scheduled spider time.
<br><br>
Another aspect of the spider queues is that they allow Gigabot to perform depth-first spidering. When no priority is explicitly given for a URL then Gigabot will assign the URL the priority of the "linker from which it was found" minus one.
-->
<!--
<br><br>
<b>Custom Filters</b>
<br><br>
You can write your own filters and hook them into Gigablast. A filter is an executable that takes an HTTP reply as input through stdin and makes adjustments to that input before passing it back out through stdout. The HTTP reply is essentially the reply Gigabot received from a web server when requesting a URL. The HTTP reply consists of an HTTP MIME header followed by the content for the URL.
<br><br>
Gigablast also appends <b>Last-Indexed-Date</b>, <b>Collection</b>, <b>Url</b> and <b>DocId</b> fields to the MIME in order to supply your filter with more information. The Last-Indexed-Date is the time that Gigablast last indexed that URL. It is -1 if the URL's content is currently not in the index.
<br><br>
You can specify the name of your filter (an executable program) on the <a href="/admin/spider">Spider Controls</a> page. After Gigabot downloads a web page it will write the HTTP reply into a temporary file stored in the /tmp directory. Then it will pass the filename as the first argument to the first filter by calling the system() function. popen() was used previously but was found to be buggy under Linux 2.4.17. Your program should send the filtered reply back out through stdout.
<br><br>
You can use multiple filters by using the pipe operator and entering a filter like "./filter1 | ./filter2 | ./filter3". In this case, only "filter1" would receive the temporary filename as its argument, the others would read from stdin.
<br><br>
-->
<!--
<a name=quotas></>
<b>Document Quotas</b>
<br><br>
You can limit the number of documents on a per site basis. By default the site is defined to be the full hostname of a url, like, <i>www.ibm.com</i>. However, using tagdb you can define the site as a domain or even a subfolder within the url. By adjusting the <maxDocs> parameter in the <a href=#ruleset>ruleset</a> for a particular url you can control how many documents are allowed into the index from that site. Additionally, the quotaBoost tables in the same ruleset file allow you to influence how a quota is changed based on the quality of the url being indexed and the quality of its root page. Furthermore, the Spider Controls allow you to turn quota checking on and off for old and new documents. <br><br>The quota checking routine quickly obtains a decent approximation of how many documents a particular site has in the index, but this approximation becomes higher than the actual count as the number of big indexdb files increases, so you may want to keep <indexdbMinFilesToMerge> in <a href=#config>gb.conf</a> down to a value of around five or so to ensure a half way decent approximation. Typically you can excpect to be off by about 1000 to 2000 documents for every indexdb file you have.<br><br>
<br><br>
-->
<!--
<a name=injecting></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Injecting Documents</td></tr></table>
<br>
<<i>Caution: Old Documentation</i>>
<br>
<br>
<b>Injection Methods</b>
<br><br>
Gigablast allows you to inject documents directly into the index by using the command <b>gb [-c <<a href=#hosts>hosts.conf</a>>] <hostId> --inject <file></b> where <file> must be a sequence of HTTP requests as described below. They will be sent to the host with id <hostId>.<br><br>
You can also inject your own content a second way, by using the <a href="/admin/inject">Inject URL</a> page. <br><br>
Thirdly you can use your own program to feed the content directly to Gigablast using the same form parameters as the form on the Inject URL page.<br><br>
<br><br><br>
<b>Input Parameters</b>
<br><br>
When sending an injection HTTP request to a Gigablast server, you may optionally supply an HTTP MIME in addition to the content. This MIME is treated as if Gigablast's spider downloaded the page you are injecting and received that MIME. If you do supply this MIME you must make sure it is HTTP compliant, preceeds the actual content and ends with a "
" followed by the content itself. The smallest mime header you can get away with is "HTTP 200
" which is just an "OK" reply from an HTTP server.<br><br>
The cgi parameters accepted by the /inject URL for injecting content are the following: (<b>remember to map spaces to +'s, etc.</b>)<br><br>
<table cellpadding=4>
<tr><td bgcolor=#eeeeee>u=X</b></td>
<td bgcolor=#eeeeee>X is the url you are injecting. This is required.</td></tr>
<tr><td>c=X</b></td>
<td>X is the name of the collection into which you are injecting the content. This is required.</td></tr>
<tr><td bgcolor=#eeeeee>delete=X</b></td>
<td bgcolor=#eeeeee>X is 0 to add the URL/content and 1 to delete the URL/content from the index. Default is 0.</td></tr>
<tr><td>ip=X</b></td>
<td>X is the ip of the URL (i.e. 1.2.3.4). If this is ommitted or invalid then Gigablast will lookup the IP, provided <i>iplookups</i> is true. But if <i>iplookups</i> is false, Gigablast will use the default IP of 1.2.3.4.</td></tr>
<tr><td bgcolor=#eeeeee>iplookups=X</b></td>
<td bgcolor=#eeeeee>If X is 1 and the ip of the URL is not valid or provided then Gigablast will look it up. If X is 0 Gigablast will never look up the IP of the URL. Default is 1.</td></tr>
-->
<!--<tr><td>isnew=X</b></td>
<td>If X is 0 then the URL is presumed to already be in the index. If X is 1 then URL is presumed to not be in the index. Omitting this parameter is ok for now. In the future it may be put to use to help save disk seeks. Default is 1.</td></tr>-->
<!--
<tr><td>dedup=X</b></td>
<td>If X is 1 then Gigablast will not add the URL if another already exists in the index from the same domain with the same content. If X is 0 then Gigablast will not do any deduping. Default is 1.</td></tr>
<tr><td bgcolor=#eeeeee>rs=X</b></td>
<td bgcolor=#eeeeee>X is the number of the <a href=#ruleset>ruleset</a> to use to index the URL and its content. It will be auto-determined if <i>rs</i> is omitted or <i>rs</i> is -1.</td></tr>
<tr><td>quick=X</b></td>
<td>If X is 1 then the reply returned after the content is injected is the reply described directly below this table. If X is 0 then the reply will be the HTML form interface.</td></tr>
<tr><td bgcolor=#eeeeee>hasmime=X</b></td>
<td bgcolor=#eeeeee>X is 1 if the provided content includes a valid HTTP MIME header, 0 otherwise. Default is 0.</td></tr>
<tr><td>content=X</b></td>
<td>X is the content for the provided URL. If <i>hasmime</i> is true then the first part of the content is really an HTTP mime header, followed by "
", and then the actual content.</td></tr>
<tr><td bgcolor=#eeeeee>ucontent=X</b></td>
<td bgcolor=#eeeeee>X is the UNencoded content for the provided URL. Use this one <b>instead</b> of the <i>content</i> cgi parameter if you do not want to encode the content. This breaks the HTTP protocol standard, but is convenient because the caller does not have to convert special characters in the document to their corresponding HTTP code sequences. <b>IMPORTANT</b>: this cgi parameter must be the last one in the list.</td></tr>
</table>
<br><br>
<b>Sample Injection Request</b> (line breaks are \r\n):<br>
<pre>
POST /inject HTTP/1.0
Content-Length: 291
Content-Type: text/html
Connection: Close
u=myurl&c=&delete=0&ip=4.5.6.7&iplookups=0&dedup=1&rs=7&quick=1&hasmime=1&ucontent=HTTP 200
Last-Modified: Sun, 06 Nov 1994 08:49:37 GMT
Connection: Close
Content-Type: text/html
</pre>
<i>ucontent</i> is the unencoded content of the page we are injecting. It allows you to specifiy data without having to url encode it for performance and ease.
<br><br>
<b>The Reply</b>
<br><br>
<a name=ireply></a>The reply is always a typical HTTP reply, but if you defined <i>quick=1</i> then the *content* (the stuff below the returned MIME) of the HTTP reply to the injection request is of the format:<br>
<br>
<X> docId=<Y> hostId=<Z><br>
<br>
OR<br>
<br>
<X> <error message><br>
<br>
Where <X> is a string of digits in ASCII, corresponding to the error code. X is 0 on success (no error) in which case it will be followed by a <b>long long</b> docId and a hostId, which corresponds to the host in the <a href=#hosts>hosts.conf</a> file that stored the document. Any twins in its group (shard) will also have copies. If there was an error then X will be greater than 0 and may be followed by a space then the error message itself. If you did not define <i>quick=1</i>, then you will get back a response meant to be viewed on a browser.<br>
<br>
Make sure to read the complete reply before spawning another request, lest Gigablast become flooded with requests.<br>
<br>
Example success reply: <b>0 docId=123543 hostId=0</b><br>
Example error reply: <b>12 Cannot allocate memory</b>
<br>
<br>
See the <a href=#errors>Error Codes</a> for all errors, but the following
errors are most likely:<br>
<table cellpadding=2>
<tr><td><b> 12 Cannot allocate memory</b></td><td>There was a shortage of memory to properly process the request.</td></tr>
<tr><td><b>32771 Record not found</b></td><td>A cached page was not found when it should have been, likely due to corrupt data on disk.</td></tr>
<tr><td><b>32769 Try doing it again</b></td><td>There was a shortage of resources so the request should be repeated.</td></tr>
<tr><td><b>32863 No collection record</b></td><td>The injection was to a collection that does not exist.</td></tr>
</table>
<br>
<br><br>
<a name=deleting></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Deleting Documents</td></tr></table>
<br>
<<i>Caution: Old Documentation</i>>
<br>
<br>
You can delete documents from the index two ways:<ul>
<li>Perhaps the most popular is to use the <a href="/admin/reindex">Reindex URLs</a> tool which allows you to delete all documents that match a simple query. Furthermore, that tool allows you to assign rulesets to all the domains of all the matching documents. All documents that match the query will have their docids stored in a spider queue of a user-specified priority. The spider will have to be enabled for that priority queue for the deletion to take place. Deleting documents is very similar to adding documents.<br><br>
<li>To delete a single document you can use the <a href="/admin/inject">Inject URL</a> page.
</ul>
-->
<!--
<br>
<a name=metas></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Indexing User-Defined Meta Tags</td></tr></table>
<br>
<<i>Caution: Old Documentation</i>>
<br>
<br>
Gigablast supports the indexing, searching and displaying of user-defined meta tags. For instance, if you have a tag like <i><meta name="foo" content="bar baz"></i> in your document, then you will be able to do a search like <i><a href="/search?q=foo%3Abar&dt=foo">foo:bar</a></i> or <i><a href="/search?q=foo%3A%22bar+baz%22&dt=foo">foo:"bar baz"</a></i> and Gigablast will find your document. <br><br>
You can tell Gigablast to display the contents of arbitrary meta tags in the search results, like <a href="/search?q=gigablast&s=10&dt=author+keywords%3A32">this</a>. Note that you must assign the <i>dt</i> cgi parameter to a space-separated list of the names of the meta tags you want to display. You can limit the number of returned characters of each tag to X characters by appending a <i>:X</i> to the name of the meta tag supplied to the <i>dt</i> parameter. In the link above, I limited the displayed keywords to 32 characters. The content of the meta tags is also provided in the <display> tags in the <a href="#output">XML feed</a>
<br><br>
Gigablast will index the content of all meta tags in this manner. Meta tags with the same <i>name</i> parameter as other meta tags in the same document will be indexed as well.
<br><br>
Why use user-defined metas? Because it is very powerful. It allows you to embed custom data in your documents, search for it and retrieve it.<br>
<br>
You can also explicitly specify how to index certain meta tags by making an <index> tag in the <a href="#ruleset">ruleset</a> as shown <a href="#rsmetas">here</a>. The specified meta tags will be indexed in the user-defined meta tag fashion as described above, in addition to any method described in the ruleset.<br>
<br>
<br>
-->
<!--
<a name=bigdocs></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Indexing Big Documents</td></tr></table>
<br>
<<i>Caution: Old Documentation</i>>
<br>
<br>
When indexing a document you will be bound by the available memory of the machine that is doing the indexing. A document that is dense in words can takes as much as ten times the memory as the size of the document in order to process it for indexing. Therefore you need to make sure that the amount of available memory is adequate to process the document you want to index. You can turn off Spam detection to reduce the processing overhead by a little bit.<br>
<br>
The <b><maxMem></b> tag in the <a href=#config>gb.conf</a> file controls the maximum amount of memory that the whole Gigablast process can use. HOWEVER, this memory is shared by databases, thread stacks, protocol stacks and other things that may or may not use most of it. Probably, the best way to see much memory is available to the Gigablast process for processing a big document is to look at the <b>Stats Page</b>. It shows you exactly how much memory is being used at the time you look at it. Hit refresh to see it change.<br>
<br>
You can also check all the tags in the gb.conf file that have the word "mem" in them to see where memory is being allocated. In addition, you will need to check the first 100 lines of the log file for the gigablast process to see how much memory is being used for thread and protocol stacks. These should be displayed on the Stats page, but are currently not.<br>
<br>
After ensuring you have enough extra memory to handle the document size, you will need to make sure the document fits into the tree that is used to hold the documents in memory before they get dumped to disk. The documents are compressed using zlib before being added to the tree so you might expect a 5:1 compression for a typical web page. The memory used to hold document in this tree is controllable from the <b><titledbMaxTreeMem></b> parameter in the gb.conf file. Make sure that is big enough to hold the document you would like to add. If the tree could accomodate the big document, but at the time is partially full, Gigablast will automatically dump the tree to disk and keep trying to add the big document.<br>
<br>
Finally, you need to ensure that the <b>max text doc len</b> and <b>max other doc len</b> controls on the <b>Spider Controls</b> page are set to accomodating sizes. Use -1 to indicate no maximum. <i>Other</i> documents are non-text and non-html documents, like PDF, for example. These controls will physically prohibit the spider from downloading more than this many bytes. This causes excessively long documents to be truncated. If the spider is downloading a PDF that gets truncated then it abandons it, because truncated PDFs are useless.<br>
<br>
<br>
-->
<!--
<a name=rolling></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Rolling the New Index</td></tr></table>
<br>
Just because you have indexed a lot of pages does not mean those pages are being searched. If the <b>restrict indexdb for queries</b> switch on the <a href="/admin/spider">Spider Controls</a> page is on for your collection then any query you do may not be searching some of the more recently indexed data. You have two options:<br><br>
<b>1.</b>You can turn this switch off which will tell Gigablast to search all the files in the index which will give you a realtime search, but, if <indexdbMinFilesToMerge> is set to <i>X</i> in the <a href=#config>gb.conf</a> file, then Gigablast may have to search X files for every query term. So if X is 40 this can destroy your performance. But high X values are indeed useful for speeding up the build time. Typically, I set X to 4 on gigablast.com, but for doing initial builds I will set it to 40.<br><br>
<b>2.</b>The second option you have for making the newer data searchable is to do a <i>tight merge</i> of indexdb. This tells Gigablast to combine the X files into one. Tight merges typically take about 2-4 minutes for every gigabyte of data that is merged. So if all of your indexdb* files are about 50 gigabytes, plan on waiting about 150 minutes for the merge to complete.<br><br>
<b>IMPORTANT</b>: Before you do the tight merge you should do a <b>disk dump</b> which tells Gigablast to dump all data in memory to disk so that it can be merged. In this way you ensure your final merged file will contain *all* your data. You may have to wait a while for the disk dump to complete because it may have to do some merging right after the dump to keep the number of files below <indexdbMinFilesToMerge>.<br><br>
Now if you are <a href=#input>interfacing to Gigablast</a> from another program you can use the <b>&rt=[0|1]</b> real time search cgi parameter. If you set this to 0 then Gigablast will only search the first file in the index, otherwise it will search all files.<br><br>
-->
<a name=dmoz></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Building a DMOZ Based Directory</td></tr></table>
<br>
<<i>Last Updated Jan 23, 2016</i>>
<br>
<<i>Procedure tested on 32-bit Gigablast on Ubuntu 14.04 on Nov 18, 2014</i>>
<br>
<<i>Procedure tested on 64-bit Gigablast on Ubuntu 14.04 on Jan 23, 2016</i>>
<br>
<br>
<b>Building the DMOZ Directory:</b>
<br><ul><li>Create the <i>dmozparse</i> program.<br> <b>$ make dmozparse</b><br>
<br>
<li>Download the latest content.rdf.u8 and structure.rdf.u8 files from http://rdf.dmoz.org/rdf into the <i>catdb/</i> directory onto host 0, the first host listed in the <a href=/hosts.conf.txt>hosts.conf</a> file.
<b>
<br> $ mkdir catdb
<br> $ cd catdb
<br> $ wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
<br> $ gunzip content.rdf.u8.gz
<br> $ wget http://rdf.dmoz.org/rdf/structure.rdf.u8.gz
<br> $ gunzip structure.rdf.u8.gz</b>