objective c: Clustered/nonclustered primary key dilemma

Suppose we'd have to define optimal indexing for Stackoverflow questions. But let's not take the schema of the actual Posts table, let's just include those columns that are actually relevant:

create table Posts (
    Id int not null
        identity,
    PostType tinyint not null,
    CreationDate datetime not null
        default getdate(),
    LastActivityDate datetime not null
        default getdate(),
    Title nvarchar(500) null, -- answers don't have titles
    Body nvarchar(max) not null,
    ...
)

I've added Id to be identity even though Data Stackexchange shows that none of the tables has a primary key constraint on them, nor identity columns. There are many just unique/non-unique clustered/non-clustered indices.

Usage scenarios

So basically two main scenarios for posts:

They're chronologically displayed in descending order by their LastActivityDate column (or maybe LastEditDate that I haven't included above as it's not so important)
They're individually displayed on question details
Answers are displayed on question details page in votes order (ScoreCount column not part of my upper code)

Indexing optimization

Which indices would be best created on above scenarios especially if we'd say that #1 is the most common scenario so it has to work really fast.

I'd say that one of the better possibilities would be to create these indices:

-- index 1
alter table Posts
add primary key nonclustered (Id);

-- index 2
create clustered index IX_Posts_LastActivityDate
on Posts(LastActivityDate desc);

-- index 3
create index IX_Posts_ParentId
on Posts(ParentId, PostTypeId)
include (ScoreCount);

This way we basically end up with three indices of which the second one is clustered.

So in order for #1 to work really fast I've set clustered index on LastActivityDate column, because clustered indices are especially great when we do range comparison on them. And we would be ordering questions chronologically newest to oldest hence I've set ordering direction and also included type on the clustered index.

So what did we solve with this?

scenario #1 is very efficiently covered by index 2 as it's clustered and fully covered; we can also easily and efficiently do result paging;
scenario #2 is somewhat covered with unique index 1 (to get the question) and non-unique index 3 to get all related answers ordered by ScoreCount; and if we decide to chronologically order answers that's also covered with index 2;

Question 1

SQL internals are such that SQL implicitly adds clustered key to nonclustering index so it can locate records in the row store.

if clustering index is unique, than that's the key that will be added to nonclustering indices, and
if clustering index is non-unique, SQL supposedly generates its own UniqueId and uses that

Since I've also added a nonclustered primary key on the table (which must by design be unique), I would like to know whether SQL will still supply its own unique key on clustered non-unique index or will it use nonclustered primary key to uniquely identify each records instead?

Question 2

So if primary key isn't used to locate records on row store (clustered index) does it even make sense to actually create a PK? Would in this case be better to rather do this?

create unique index UX_Posts_Id
on Posts(Id);
-- include (Title, Body, ScoreCount);

It would be great to also include commented out columns, but then that would make this index inefficient as it will be worse in caching... Why I'm asking whether it would be better to create this index instead of a primary key constraint is because we can include additional non-key columns to this index while we can't do the same when we add a PK constraint that internally generates a unique index...

Question 3

I'm aware that LastActivityDate changes which isn't desired with clustered indices, but we have to consider the fact that this column is more likely to change for some time before it becomes more or less static, so it shouldn't cause too much index fragmentation as records will mostly be appended to the end whenever LastActivityDate changes. Index fragmentation would never happen because some new record would be inserted into some old(er) page. Most modifications will happen on the last page.

So the question is whether these changes can be harmful as LastActivityDate isn't the best candidate for clustering index key:

it's not unique - although one could argue about this, especially if we'd change datetime to datetime2 and use higher precision function sysdatetime()
it's narrow - pretty much
it's not static - but I've explained how it changes
it's ever increasing

objective c

vendredi 31 juillet 2015

Clustered/nonclustered primary key dilemma

Usage scenarios

Indexing optimization

Question 1

Question 2

Question 3

Aucun commentaire:

Enregistrer un commentaire