@@ -341,3 +341,214 @@ Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612
341
341
good, you'll have to ram them down people's throats." -- Howard Aiken
342
342
343
343
344
+ From
[email protected] Tue Oct 19 10:31:10 1999
345
+ Received: from renoir.op.net (
[email protected] [209.152.193.4])
346
+ by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id KAA29087
347
+ for <
[email protected] >; Tue, 19 Oct 1999 10:31:08 -0400 (EDT)
348
+ Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.2 $) with ESMTP id KAA27535 for <
[email protected] >; Tue, 19 Oct 1999 10:19:47 -0400 (EDT)
349
+ Received: from localhost (majordom@localhost)
350
+ by hub.org (8.9.3/8.9.3) with SMTP id KAA30328;
351
+ Tue, 19 Oct 1999 10:12:10 -0400 (EDT)
352
+ (envelope-from owner-pgsql-hackers)
353
+ Received: by hub.org (bulk_mailer v1.5); Tue, 19 Oct 1999 10:11:55 -0400
354
+ Received: (from majordom@localhost)
355
+ by hub.org (8.9.3/8.9.3) id KAA30030
356
+ for pgsql-hackers-outgoing; Tue, 19 Oct 1999 10:11:00 -0400 (EDT)
357
+
358
+ Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2])
359
+ by hub.org (8.9.3/8.9.3) with ESMTP id KAA29914
360
+ for <
[email protected] >; Tue, 19 Oct 1999 10:10:33 -0400 (EDT)
361
+
362
+ Received: from sss.sss.pgh.pa.us (localhost [127.0.0.1])
363
+ by sss.sss.pgh.pa.us (8.9.1/8.9.1) with ESMTP id KAA09038;
364
+ Tue, 19 Oct 1999 10:09:15 -0400 (EDT)
365
+ To: "Hiroshi Inoue" <
[email protected] >
366
+
367
+ Subject: Re: [HACKERS] mdnblocks is an amazing time sink in huge relations
368
+ In-reply-to: Your message of Tue, 19 Oct 1999 19:03:22 +0900
369
+
370
+ Date: Tue, 19 Oct 1999 10:09:15 -0400
371
+
372
+ From: Tom Lane <
[email protected] >
373
+
374
+ Status: OR
375
+
376
+ "Hiroshi Inoue" <
[email protected] > writes:
377
+ > 1. shared cache holds committed system tuples.
378
+ > 2. private cache holds uncommitted system tuples.
379
+ > 3. relpages of shared cache are updated immediately by
380
+ > phisical change and corresponding buffer pages are
381
+ > marked dirty.
382
+ > 4. on commit, the contents of uncommitted tuples except
383
+ > relpages,reltuples,... are copied to correponding tuples
384
+ > in shared cache and the combined contents are
385
+ > committed.
386
+ > If so,catalog cache invalidation would be no longer needed.
387
+ > But synchronization of the step 4. may be difficult.
388
+
389
+ I think the main problem is that relpages and reltuples shouldn't
390
+ be kept in pg_class columns at all, because they need to have
391
+ very different update behavior from the other pg_class columns.
392
+
393
+ The rest of pg_class is update-on-commit, and we can lock down any one
394
+ row in the normal MVCC way (if transaction A has modified a row and
395
+ transaction B also wants to modify it, B waits for A to commit or abort,
396
+ so it can know which version of the row to start from). Furthermore,
397
+ there can legitimately be several different values of a row in use in
398
+ different places: the latest committed, an uncommitted modification, and
399
+ one or more old values that are still being used by active transactions
400
+ because they were current when those transactions started. (BTW, the
401
+ present relcache is pretty bad about maintaining pure MVCC transaction
402
+ semantics like this, but it seems clear to me that that's the direction
403
+ we want to go in.)
404
+
405
+ relpages cannot operate this way. To be useful for avoiding lseeks,
406
+ relpages *must* change exactly when the physical file changes. It
407
+ matters not at all whether the particular transaction that extended the
408
+ file ultimately commits or not. Moreover there can be only one correct
409
+ value (per relation) across the whole system, because there is only one
410
+ length of the relation file.
411
+
412
+ If we want to take reltuples seriously and try to maintain it
413
+ on-the-fly, then I think it needs still a third behavior. Clearly
414
+ it cannot be updated using MVCC rules, or we lose all writer
415
+ concurrency (if A has added tuples to a rel, B would have to wait
416
+ for A to commit before it could update reltuples...). Furthermore
417
+ "updating" isn't a simple matter of storing what you think the new
418
+ value is; otherwise two transactions adding tuples in parallel would
419
+ leave the wrong answer after B commits and overwrites A's value.
420
+ I think it would work for each transaction to keep track of a net delta
421
+ in reltuples for each table it's changed (total tuples added less total
422
+ tuples deleted), and then atomically add that value to the table's
423
+ shared reltuples counter during commit. But that still leaves the
424
+ problem of how you use the counter during a transaction to get an
425
+ accurate answer to the question "If I scan this table now, how many tuples
426
+ will I see?" At the time the question is asked, the current shared
427
+ counter value might include the effects of transactions that have
428
+ committed since your transaction started, and therefore are not visible
429
+ under MVCC rules. I think getting the correct answer would involve
430
+ making an instantaneous copy of the current counter at the start of
431
+ your xact, and then adding your own private net-uncommitted-delta to
432
+ the saved shared counter value when asked the question. This doesn't
433
+ look real practical --- you'd have to save the reltuples counts of
434
+ *all* tables in the database at the start of each xact, on the off
435
+ chance that you might need them. Ugh. Perhaps someone has a better
436
+ idea. In any case, reltuples clearly needs different mechanisms than
437
+ the ordinary fields in pg_class do, because updating it will be a
438
+ performance bottleneck otherwise.
439
+
440
+ If we allow reltuples to be updated only by vacuum-like events, as
441
+ it is now, then I think keeping it in pg_class is still OK.
442
+
443
+ In short, it seems clear to me that relpages should be removed from
444
+ pg_class and kept somewhere else if we want to make it more reliable
445
+ than it is now, and the same for reltuples (but reltuples doesn't
446
+ behave the same as relpages, and probably ought to be handled
447
+ differently).
448
+
449
+ regards, tom lane
450
+
451
+ ************
452
+
453
+ From
[email protected] Tue Oct 19 21:25:30 1999
454
+ Received: from renoir.op.net (
[email protected] [209.152.193.4])
455
+ by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA28130
456
+ for <
[email protected] >; Tue, 19 Oct 1999 21:25:26 -0400 (EDT)
457
+ Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.2 $) with ESMTP id VAA10512 for <
[email protected] >; Tue, 19 Oct 1999 21:15:28 -0400 (EDT)
458
+ Received: from localhost (majordom@localhost)
459
+ by hub.org (8.9.3/8.9.3) with SMTP id VAA50745;
460
+ Tue, 19 Oct 1999 21:07:23 -0400 (EDT)
461
+ (envelope-from owner-pgsql-hackers)
462
+ Received: by hub.org (bulk_mailer v1.5); Tue, 19 Oct 1999 21:07:01 -0400
463
+ Received: (from majordom@localhost)
464
+ by hub.org (8.9.3/8.9.3) id VAA50644
465
+ for pgsql-hackers-outgoing; Tue, 19 Oct 1999 21:06:06 -0400 (EDT)
466
+
467
+ Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34])
468
+ by hub.org (8.9.3/8.9.3) with ESMTP id VAA50584
469
+ for <
[email protected] >; Tue, 19 Oct 1999 21:05:26 -0400 (EDT)
470
+
471
+ Received: from cadzone ([126.0.1.40] (may be forged))
472
+ by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP
473
+ id KAA01715; Wed, 20 Oct 1999 10:05:14 +0900
474
+ From: "Hiroshi Inoue" <
[email protected] >
475
+ To: "Tom Lane" <
[email protected] >
476
+
477
+ Subject: RE: [HACKERS] mdnblocks is an amazing time sink in huge relations
478
+ Date: Wed, 20 Oct 1999 10:09:13 +0900
479
+
480
+ MIME-Version: 1.0
481
+ Content-Type: text/plain;
482
+ charset="iso-8859-1"
483
+ Content-Transfer-Encoding: 7bit
484
+ X-Priority: 3 (Normal)
485
+ X-MSMail-Priority: Normal
486
+ X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0
487
+ X-Mimeole: Produced By Microsoft MimeOLE V4.72.2106.4
488
+ Importance: Normal
489
+
490
+ Status: ORr
491
+
492
+ > -----Original Message-----
493
+ > From: Hiroshi Inoue [mailto:
[email protected] ]
494
+ > Sent: Tuesday, October 19, 1999 6:45 PM
495
+ > To: Tom Lane
496
+
497
+ > Subject: RE: [HACKERS] mdnblocks is an amazing time sink in huge
498
+ > relations
499
+ >
500
+ >
501
+ > >
502
+ > > "Hiroshi Inoue" <
[email protected] > writes:
503
+ >
504
+ > [snip]
505
+ >
506
+ > >
507
+ > > > Deletion is necessary only not to consume disk space.
508
+ > > >
509
+ > > > For example vacuum could remove not deleted files.
510
+ > >
511
+ > > Hmm ... interesting idea ... but I can hear the complaints
512
+ > > from users already...
513
+ > >
514
+ >
515
+ > My idea is only an analogy of PostgreSQL's simple recovery
516
+ > mechanism of tuples.
517
+ >
518
+ > And my main point is
519
+ > "delete fails after commit" doesn't harm the database
520
+ > except that not deleted files consume disk space.
521
+ >
522
+ > Of cource,it's preferable to delete relation files immediately
523
+ > after(or just when) commit.
524
+ > Useless files are visible though useless tuples are invisible.
525
+ >
526
+
527
+ Anyway I don't need "DROP TABLE inside transactions" now
528
+ and my idea is originally for that issue.
529
+
530
+ After a thought,I propose the following solution.
531
+
532
+ 1. mdcreate() couldn't create existent relation files.
533
+ If the existent file is of length zero,we would overwrite
534
+ the file.(seems the comment in md.c says so but the
535
+ code doesn't do so).
536
+ If the file is an Index relation file,we would overwrite
537
+ the file.
538
+
539
+ 2. mdunlink() couldn't unlink non-existent relation files.
540
+ mdunlink() doesn't call elog(ERROR) even if the file
541
+ doesn't exist,though I couldn't find where to change
542
+ now.
543
+ mdopen() doesn't call elog(ERROR) even if the file
544
+ doesn't exist and leaves the relation as CLOSED.
545
+
546
+ Comments ?
547
+
548
+ Regards.
549
+
550
+ Hiroshi Inoue
551
+
552
+
553
+ ************
554
+
0 commit comments