Suppose we'd have to define optimal indexing for Stackoverflow questions. But let's not take the schema of the actual Posts table, let's just include those columns that are actually relevant:
create table Posts (
Id int not null
identity,
PostType tinyint not null,
CreationDate datetime not null
default getdate(),
LastActivityDate datetime not null
default getdate(),
Title nvarchar(500) null, -- answers don't have titles
Body nvarchar(max) not null,
...
)
I've added Id to be identity even though Data Stackexchange shows that none of the tables has a primary key constraint on them, nor identity columns. There are many just unique/non-unique clustered/non-clustered indices.
Usage scenarios
So basically two main scenarios for posts:
- They're chronologically displayed in descending order by their
LastActivityDatecolumn (or maybeLastEditDatethat I haven't included above as it's not so important) - They're individually displayed on question details
- Answers are displayed on question details page in votes order (
ScoreCountcolumn not part of my upper code)
Indexing optimization
Which indices would be best created on above scenarios especially if we'd say that #1 is the most common scenario so it has to work really fast.
I'd say that one of the better possibilities would be to create these indices:
-- index 1
alter table Posts
add primary key nonclustered (Id);
-- index 2
create clustered index IX_Posts_LastActivityDate
on Posts(LastActivityDate desc);
-- index 3
create index IX_Posts_ParentId
on Posts(ParentId, PostTypeId)
include (ScoreCount);
This way we basically end up with three indices of which the second one is clustered.
So in order for #1 to work really fast I've set clustered index on LastActivityDate column, because clustered indices are especially great when we do range comparison on them. And we would be ordering questions chronologically newest to oldest hence I've set ordering direction and also included type on the clustered index.
So what did we solve with this?
- scenario #1 is very efficiently covered by index 2 as it's clustered and fully covered; we can also easily and efficiently do result paging;
- scenario #2 is somewhat covered with unique index 1 (to get the question) and non-unique index 3 to get all related answers ordered by
ScoreCount; and if we decide to chronologically order answers that's also covered with index 2;
Question 1
SQL internals are such that SQL implicitly adds clustered key to nonclustering index so it can locate records in the row store.
- if clustering index is unique, than that's the key that will be added to nonclustering indices, and
- if clustering index is non-unique, SQL supposedly generates its own
UniqueIdand uses that
Since I've also added a nonclustered primary key on the table (which must by design be unique), I would like to know whether SQL will still supply its own unique key on clustered non-unique index or will it use nonclustered primary key to uniquely identify each records instead?
Question 2
So if primary key isn't used to locate records on row store (clustered index) does it even make sense to actually create a PK? Would in this case be better to rather do this?
create unique index UX_Posts_Id
on Posts(Id);
-- include (Title, Body, ScoreCount);
It would be great to also include commented out columns, but then that would make this index inefficient as it will be worse in caching... Why I'm asking whether it would be better to create this index instead of a primary key constraint is because we can include additional non-key columns to this index while we can't do the same when we add a PK constraint that internally generates a unique index...
Question 3
I'm aware that LastActivityDate changes which isn't desired with clustered indices, but we have to consider the fact that this column is more likely to change for some time before it becomes more or less static, so it shouldn't cause too much index fragmentation as records will mostly be appended to the end whenever LastActivityDate changes. Index fragmentation would never happen because some new record would be inserted into some old(er) page. Most modifications will happen on the last page.
So the question is whether these changes can be harmful as LastActivityDate isn't the best candidate for clustering index key:
- it's not unique - although one could argue about this, especially if we'd change
datetimetodatetime2and use higher precision functionsysdatetime() - it's narrow - pretty much
- it's not static - but I've explained how it changes
- it's ever increasing
Aucun commentaire:
Enregistrer un commentaire