Thursday, January 12, 2012

Multi-tier application + database deadlock or why databases aren't queues (part1)

Databases aren't queues.

And despite the ubiquitous presence of queuing technology out there (ActiveMQ, MSMQ, MSSQL Service Broker, Oracle Advanced Queuing) there are plenty of times when we ask our relational brethren to pretend to be queues.  This is the story of one such folly, and along the way, we'll delve into some interesting sub-plots of deadlocks, lock escalation, execution plans, and covering indexes, oh my!  Hopefully we'll laugh, we'll cry, and get the bad guy in the end (turned out I was the bad guy).

This is part one of a multi-part series describing the whole saga. In this part, I lay out the problem, the initial symptom, and the tools and commands I used to figure out what was going wrong.

And so it starts...

I'm going to set the stage for our discussion, to introduce you to the problem, and establish the characters involved in our tragedy. Let's say that this system organizes music CDs into labeled buckets. A CD can only be in one bucket at a time, and the bucket tracks at an aggregate level how many CDs are contained within it (e.g. bucket "size"). You can visualize having a stack of CDs and two buckets: "good CDs" and "bad CDs". Every once in a while you decide that you don't like your bucket choices, and you want to redistribute the CDs into new buckets--perhaps by decade: "1980s music", "1990s music", "all other (inferior) music". Later you might change your mind again and come up with a new way to organize your CDs, etc. We will call each "set" of buckets a "generation". So at generation zero you had 2 buckets "good CDs" and "bad CDs", at generation one you had "1980s CDs", etc, and so on and so on. The generation always increases over time as you redistribute your CDs from a previous generation's buckets to the next generation's buckets.

Lastly, while I might have my music collection organized in some bucket scheme, perhaps my friend Jerry has his own collection and his own bucket scheme. So entire sets of buckets over generations can be grouped into music collections. Collections are completely independent: Jerry and I don't share CDs nor do we share buckets.  We just happen to be able to use the same system to manage our music collections.

So we have:
  • CDs -- the things which are contained within buckets, that we redistribute for every new generation
  • Buckets -- the organizational units, grouped by generation, which contain CDs. Each bucket has a sticky note on it with the number of CDs currently in the bucket.
  • Generation -- the set of buckets at a particular point in time.
  • Collection -- independent set of CDs and buckets
Even while we're redistributing the CDs from one generation's buckets to the next, a CD is only in one bucket at a time. Visualize physically moving the CD from "good CDs" (generation 0) to "1980s music" (generation 1).

NOTE: Our actual system has nothing to do with CDs and buckets-- I just found it easier to map the system into this easy to visualize metaphor.

In this system we have millions of CDs, thousands of buckets, and lots of CPUs moving CDs from bins in one generation to the next (parallel, but not distributed). The size of each bucket must be consistent at any point in time.

So assume the database model looks something like:
  • Buckets Table
    • bucketId (e.g. 1,2,3,4) - PRIMARY KEY CLUSTERED
    • name (e.g. 80s music, 90s music)
    • generation (e.g. 0, 1, 2)
    • size (e.g. 4323, 122)
    • collectionId (e.g. "Steves Music Collection") - NON-CLUSTERED INDEX
  • Cds Table
    • cdId (e.g. 1,2,3,4) - PRIMARY KEY CLUSTERED
    • name (e.g. "Modest Mouse - Moon and Antarctica", "Interpol - Antics")
    • bucketId (e.g. 1,2, etc. foreign key to the Bucket table) - NON-CLUSTERED INDEX
Note that both tables are clustered by their primary keys-- this means that the actual record data itself is stored in the leaf nodes of the primary index.  I.e. the table itself is an index.  In addition, Buckets can be looked up by "music collection" without scanning (see the secondary, non-clustered index on collectionId), and Cds can be looked up by bucketId without scanning (see the secondary, non-clustered index on Cds.bucketId).

The algorithm

So I wrote the redistribution process with a few design goals: (1) it needed to work online. I could concurrently add new CDs into the current generations bins while redistributing. (2) I could always locate a CD-- i.e. I could never falsely report that some CD was missing just because I happen to search during a redistribution phase. (3) if we interrupt the redistribution process, we can resume it later. (4) it needed to be parallel. I wanted to accomplish (1) and (2) with bounded blocking time so whatever blocking work I needed to do, I wanted it to be as short as possible to increase concurrency.

I used a simple concurrency abstraction that hosted a pool of workers who shared a supplier of work. The supplier of work would keep giving "chunks" of items to move from one bucket to another. We only redistribute a single music collection at a time. The supplier was shared by all of the workers, but it was synchronized for safe multi-threaded access.


The algorithm for each worker is like:
(I)   Get next chunk of work
(II)  For each CD decide the new generation bucket in which it belongs
        (accumulating the size deltas for old buckets and new buckets)
(III) Begin database transaction
(IV)   Flush accumulated size deltas for buckets
(V)    Flush foreign key updates for CDs to put them in new buckets
(VI)  Commit database transaction


Each worker would be given a chunk of CDs for the current music collection that was being redistributed (I). The worker would do some work to decide which bucket in the new generation should get the CD (II). The worker would accumulate the deltas for counts: decrementing from the original bucket and incrementing the count for the new bucket. Then the worker would flush (IV) these deltas in a correlated update like UPDATE Buckets SET size = size + 123 WHERE bucketId = 1. After the size updates were flushed, it would then flush (V) all of the individual updates to the foreign key fields to refer to the new generations buckets like UPDATE Cds SET bucketId = 123 WHERE bucketId = 101. These two operations happen in the same database transaction.

The supplier that gives work to the workers is a typical "queue" like SELECT query -- we want to iterate over all of the items in the music collection in the old generation. This happens in a separate connection, separate database transaction from the workers (discussed later). The next chunk will be read using the worker thread (with thread safe synchronization).  This separate "reader" connection doesn't have its own thread or anything.

First sign of trouble - complete stand still

So we were doing some large volume testing on not-so-fast hardware, when suddenly...the system just came to a halt. We seemed to be in the middle of moving CDs to new buckets, and they just stopped making progress.
Finding out what the Java application was doing
So first step was to see what the Java application was doing:

c:\>jps
1234 BucketRedistributionMain
3456 jps
c:\>jstack 1234 > threaddump.out


Jps finds the java process id, and then run jstack to output a stacktrace for each of the threads in the java program. Jps and Jstack are included in the JDK.

The resulting stack trace showed that all workers were waiting in socketRead to complete the database update to flush the bucket size updates (step IV above).

Here is the partial stack trace for one of the workers (some uninteresting frames omitted for brevity):

"container-lowpool-3" daemon prio=2 tid=0x0000000007b0e000 nid=0x4fb0 runnable [0x000000000950e000]
   java.lang.Thread.State: RUNNABLE
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(Unknown Source)
 ...
 at net.sourceforge.jtds.jdbc.SharedSocket.readPacket(Unknown)
 at net.sourceforge.jtds.jdbc.SharedSocket.getNetPacket(Unknown)
 ...
 at org.hibernate.jdbc.BatchingBatcher.doExecuteBatch(Unknown)
 ...
 at org.hibernate.impl.SessionImpl.flush(SessionImpl.java:1216)
 ...
 at com.mycompany.BucketRedistributor$Worker.updateBucketSizeDeltas()
 ...
 at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
 ...
 at java.lang.Thread.run(Unknown Source)

As you can see our worker was updating the bucket sizes which resulted in a Hibernate "flush" (actually pushes the database query across the wire to the database engine), and then we await the response packet from MSSQL once the statement is complete. Note that we are using the jtds MSSQL driver (as evidenced by the net.sourceforge.jtds in the stack trace.

So the next question is why is the database just hanging out doing nothing?
Finding out what the database was doing...
MSSQL provides a lot of simple ways to get insight into what the database is doing. First let's see the state of all of the connections. Open SQL Server Management Studio (SSMS), click New Query, and type exec sp_who2

This will return output that looks like:

You can see all of the spids for sessions to the database. There should be five that we're are interested in: one for the "work queue" query and four for the workers to perform updates. The sp_who2 output includes a blkBy column which shows the spid that is blocking the given spid in the case that the given spid is SUSPENDED.

We can see that spid 56 is the "work queue" SELECT query (highlighted red). Notice that no one is blocking it... then we see spids 53, 54, 60, and 61 (highlighted in yellow) that are all waiting on 56 (or each other). Disregard 58 - its application source is the management studio as you can see.

So how curious! the reader query is blocking all of the update workers and preventing them from pushing their size updates. The reader "work queue" query looks like:

SELECT c.cdId, c.bucketId FROM Buckets b INNER JOIN Cds c ON b.bucketId = c.bucketId WHERE b.collection = 'Steves Collection' and b.generation = 23
Investing blocking and locking problems...
I see that spid 56 is blocking everyone else. So what locks is 56 holding? In a new query window, I ran exec sp_lock 56 and exec sp_lock 53 to see which locks each was holding and who was waiting on what.


You can see that 56 was holding a S (shared) lock on a key resource (key = row lock on an index) of object 1349579846, which corresponds to the Buckets table.

I wanted to the engine's execution plan for the reader "work queue" query. To get this, I executed a query that I created a while ago to dump as many details about current sessions in the system as possible-- think of it as a "super who2":

select es.session_id, es.host_name, es.status as session_status, sr.blocking_session_id as req_blocked_by,
datediff(ss, es.last_request_start_time, getdate()) as last_req_submit_secs,
st.transaction_id as current_xaction_id, 
datediff(ss, dt.transaction_begin_time, getdate()) as xaction_start_secs,
case dt.transaction_type 
 when 1 then 'read_write' 
 when 2 then 'read_only'
 when 3 then 'system'
 when 4 then 'distributed'
 else 'unknown'
end as trx_type,
sr.status as current_req_status, 
sr.wait_type as current_req_wait, 
sr.wait_time as current_req_wait_time, sr.last_wait_type as current_req_last_wait, 
sr.wait_resource as current_req_wait_rsc, 
es.cpu_time as session_cpu, es.reads as session_reads, es.writes as session_writes, 
es.logical_reads as session_logical_reads, es.memory_usage as session_mem_usage,
es.last_request_start_time, es.last_request_end_time, es.transaction_isolation_level,
sc.text as last_cnx_sql, sr.text as current_sql, sr.query_plan as current_plan
from sys.dm_exec_sessions es
left outer join sys.dm_tran_session_transactions st on es.session_id = st.session_id
left outer join sys.dm_tran_active_transactions dt on st.transaction_id = dt.transaction_id
left outer join 
 (select srr.session_id, srr.start_time, srr.status, srr.blocking_session_id, 
 srr.wait_type, srr.wait_time, srr.last_wait_type, srr.wait_resource, stt.text, qp.query_plan
 from sys.dm_exec_requests srr
 cross apply sys.dm_exec_sql_text(srr.sql_handle) as stt
 cross apply sys.dm_exec_query_plan(srr.plan_handle) as qp) as sr on es.session_id = sr.session_id

left outer join 
 (select scc.session_id, sct.text
 from sys.dm_exec_connections scc
 cross apply sys.dm_exec_sql_text(scc.most_recent_sql_handle) as sct) as sc on sc.session_id = es.session_id

where 
es.session_id >= 50

In the above output, the last column is the SQL XML Execution plan. Viewing that for spid 56, I confirmed my suspicion: The plan to serve the "work queue" query was to seek the "music collection" index on the buckets table for 'Steves collection', then seek to the clustered index to confirm 'generation = 23', then seek into the bucketId index on the Cds table. So to serve the WHERE clause in the "work queue" query, the engine had to use both the non-clustered index on Buckets and the clustered index (for the version predicate).

When joining and reading rows at READ COMMITTED isolation level, the engine will acquire locks as it traverses from index to index in order to ensure consistent reads. Thus, to read the value of the generation in the Buckets table, it must acquire a shared lock. And it has!

The problem comes in when the competing sessions that are trying to update the size on that same record of the Bucket table. It needs an X (exclusive) lock on that row (highlighted in red), and eek! it can't get it, because that reader query has a conflicting S lock already granted (highlighted in green).

Ok so that all makes sense, but why is the S lock being held? At READ COMMITTED you usually only hold the locks while the record is being read (there are exceptions and we'll get to that in Part 2). They are released as soon as the value is read. So if you read 10 rows in a single statement execution, the engine will: acquire lock on row 1, read row 1, release lock on row 1, acquire lock on row 2, read row 2, release lock on row 2, acquire lock on row 3, etc. So none of the four workers are currently reading-- they are writing -- or at least they're trying to if that pesky reader connection wasn't blocking them.

To find this, I was curious why the reader query was in a SUSPENDED state (see original sp_who2 output above). In the above "super who2" output, the current_req_wait value for the "work queue" read query is ASYNC_Network_IO.
ASYNC_Network_IO wait and how databases return results
ASYNC_Network_IO is an interesting wait. Let's discuss how remote applications execute and consume SELECT queries from databases.
That diagram is over-simplified, but within the database there are two chunks of memory to discuss: the buffer cache and the connection network buffer. The buffer cache is a shared chunk of memory, where the actual pages of the tables and indexes are kept to serve queries. So parts of the Buckets and Cds tables will be in memory while this "work queue" query executes. The execution engine executes the plan, it works out of the buffer cache, acquiring locks, and producing output records to send to the client. As it prepares these output records, it puts them in a connection-specific network buffer.  When the application reads records from the result set, its actually being served from the network buffer in the database. The application driver typically has its own buffer as well.

When you just execute a simple SQL SELECT query and don't explicitly declare a database cursor, MSSQL gives you what it calls the "default result set" -- which is still a cursor of sorts -- you can think of it as a cursor over a bunch of records that you can only iterate over once in the forward direction. As your application threads iterate over the result set, the driver requests more chunks of rows from the database on your behalf, which in turn depletes the network buffer.

However, with very large result sets, the entire results cannot fit in the connection's network buffer. If the application doesn't read them fast enough, then eventually the network buffer fills up, and the execution engine must stop producing new result records to send to the client application. When this happens the spid must be suspended, and it is suspended with the wait event ASYNC_Network_IO. It's a slightly misleading wait name, because it makes you think there might be a network performance problem, but its more often an application design or performance problem.  Note that when the spid is suspended -- just like any other suspension -- the currently held locks will remain held until the spid is resumed.

In our case, we know that we have millions of CDs and we can't fit them all in application memory at one time. We, by design, want to take advantage of the fact that we can stream results from the database and work on them in chunks. Unfortunately, if we happen to be holding a conflicting lock (S lock on Bucket record) when the reader query is suspended, then we create a multi-layer application deadlock, as we observed, and the whole system screeches to a halt.

So what to do for a solution? I will discuss some options and our eventual decision in Parts 2 and 3. Note that I gave one hint at our first attempt when I talked about "covering indexes", and then there is another hint above that we didn't get to in this post about "lock escalation".

Steve

Thursday, December 22, 2011

MSSQL non-clustered indexes INCLUDE feature explained

Today I received a question from someone about the nature of the INCLUDE feature when creating a non-clustered (secondary) index on a table. My response was a bit long, and I haven't posted in a while -- ergo blogpost! Here's the question:

What is your opinion about using the include statement when building your indexes? I’ve never really used the included functionality and I’m curious if there are upsides or downsides to them. My reading lead me to believe that I should use the include when I have a non clustered index and space appears to be a concern. I can use the ‘include’ as part of the index with the columns that may not be used as frequently. We’re using standard datatypes and nothing with very large column widths. So is there a benefit to using include?



A clustered index is stored in a B+-tree data structure and the actual data the whole row of data is in the leaf. So if you think of a b-tree structure (simplified) as something like:
           A
         /   \
        B     C
       / \   / \
      M   N O   P
And you have records like
[ Id =1, FirstName = Steve, LastName = Ash ]
[ Id = 2, FirstName = Neil, LastName=Gafter ]
Then for the clustered index, the id data (id=1 and id=2) will exist at every node (A, B, C, M, N, O, P), but the bytes for “steve” “ash” “neil” “gafter” will only exist in leaf nodes (either M N O or P). The id is used to locate which leaf holds the whole record. (note this is a simplification, see Wikipedia for more info about b+-trees). The noteworthy fact is being a clustered index means that the whole record is in a leaf (i.e. M N O P).

Now let’s think of a non-clustered index on lastName that corresponds to the clustered index above.
           D
         /   \
        E     F
       / \   
      Q   R         
In this case the “last name” is used to get to the leaf, and the leaf holds the primary key of the corresponding row in the clustered index. So D E or F will have things like lastName=Ash and lastName=Gafter and the leaves Q and R will have both lastnames and IDs. So an entry in Q or R might look like [lastname=Ash, id=1] (again simplifying).

So if you issue a query like
SELECT firstName FROM ThisTable WHERE lastName = ‘Ash’
(and the optimizer chooses to use the non-clustered index) then the database engine will do something like:
  1. Seek the non-clustered index for ‘Ash’
    1. Look in D to decide which direction to go E or F (lets say E is the right choice)
    2. Look in E to decide which direction to go Q or R (lets say Q is the right choice)
  2. Find the primary key for ‘Ash’
    1. In Q find id for Ash – which is id=1
  3. Seek the clustered index for id = 1
    1. Look in A to decide which direction to go B or C (lets say B is right)
    2. Look in B to decide which direction to go M or N (lets say N is right)
  4. Find the value of firstName for id=1
    1. In N find firstName for id=1 which is ‘Steve’
  5. return ‘Steve’ as the query result
Notice that we created a non-clustered indexed on lastName and we can use that index to quickly locate things by last name, but if we need any additional info, then we have to go back to the clustered index to get the other info in the SELECT list.

The “include” provides a way for you to shove additional information in the leaf nodes of non-clustered indexes (Q and R) to alleviate this "going back" to the clustered index.

So had I created the non-clustered index above on LastName INCLUDE FirstName then the engine would only need to do:

  1. Seek the non-clustered index for ‘Ash’
    1. Look in D to decide which direction to go E or F (lets say E is the right choice)
    2. Look in E to decide which direction to go Q or R (lets say Q is the right choice)
  2. Find the firstName for ‘Ash’
    1. In Q due to the include there is id=1 AND firstName=’Steve’ so we have the first name right here!
  3. return Steve
So you get rid of that entire other seek into the clustered index. This additional seek shows up as a “bookmark lookup” operation in the query plan in MSSQL 2008 & MSSQL 2000 and just another join in MSSQL 2005 query plans. Bookmark lookup is a join -- just with a different name to indicate its semantic role in the query plan.

When you have an index such that they query can be completely served from the index without needing to go back to the clustered index – such an index is called a covering index. The index covers the needs of the query completely. And it’s a performance boost as you don’t need the other join.

So this means a few things:
  1. You can obviously only “include” fields in non-clustered indexes. Clustered indexes already have all the fields in the leaf...so it doesn’t mean anything to include more.
  2. You can only have one clustered-index for a table, but you can simulate have multiple clustered indexes by INCLUDing the rest of the columns on your secondary indexes
  3. By duplicating the data in the secondary index, if you UPDATE firstName – you now have to update both the clustered index and the nonclustered index. (Main trade-off consideration)
    • This is also a huge deadlock opportunity if you’re not using read committed snapshot isolation (RCSI) level. Think of two queries: (1)
      UPDATE MyTable SET firstName = ‘Steve2’ where id = 1
      and (2)
      SELECT shoeSize FROM MyTable WHERE lastName = ‘Ash’
      (pretend shoe size is a new field that is in the clustered index but NOT in the non-clustered, i.e. NOT in the INCLUDE). Then the SELECT will seek the non-clustered index, grab shared (S) locks, then (while holding S locks) traverse the clustered index. Whereas (in the opposite order) the UPDATE will seek the clustered index, hold an exclusive (X) lock, then seek the non-clustered index to update firstName. The fact that these two queries are holding incompatible locks in opposite directions, is a deadlock waiting to happen. If I had a dollar for every time I diagnosed this deadlock scenario...
  4. By duplicating the data in the secondary index, each page in the leaf (Q and R) now has fewer rows per page and thus there is a greater memory demand on the buffer cache (and more IO to get the same number of records) (Second trade off consideration)



  5. Some databases (even MSSQL < 2005) don’t support this feature, but you can approximate it by creating non-clustered indexes with compound keys. I.e. if I created the non-clustered index on
    (lastName, firstName)
    then the index still covers queries like
    SELECT firstName FROM myTable WHERE lastName = ‘Ash’
    Note that this is not as good as the INCLUDE solution (for this particular query) as now the bytes of firstName=’Steve’ take up some space in non-leaf nodes D E F.
So deciding to use an INCLUDE is (like everything) a trade off. If you have a performance critical query that is being executed frequently, then you can usually use INCLUDEs to reduce the number of joins and increase performance. In an environment where CPU is more precious than memory this can be a big win (we can talk about cache locality benefits of includes later). However, if you INCLUDE a column that will be UPDATEd later – then you often are shooting yourself in the foot as the cost can easily outweigh the benefit.

Last tidbit about INCLUDEs that I’ll mention. The sql index analyzer is REALLY aggressive about recommending you add indexes with lots of INCLUDEs. This is because the index analyzer usually doesn’t know how many UPDATEs you’re doing. Usually you just tell it what SELECT you want to speed up and it naively says “oh well of course if you add these three covering indexes, this SELECT will be faster.” And while that’s true it doesn’t take into account the _total_ workload (UPDATEs DELETEs etc) so just be skeptical when you see this if you use the index analyzer.

I can’t give you a hard and fast rule to say INCLUDE is GOOD or EVIL as – like everything with database performance tuning – it depends ;)

Steve

Sunday, July 31, 2011

Delegates or interfaces? Functional and OO Dualism

I have a mixed background: doing C#/.NET for ~4 years then switching to Java (switched jobs). I have been in the Java enterprise ecosystem for the last 4 years. I do mostly Java, but enjoy doing a little C# every now and again. C# is really a nice language. Shame its in such a horrible Microsoft-centric ecosystem.

In any case, I've been writing a little thing in C# and needed a type with a single method to "doWork". So coming from a Java bias I created a:
public interface IWorker {
   void DoWork();
}
Later, however, I wanted to offer the users an API option of just using a lambda to "doWork". Unfortunately, there is no type conversion from a delegate to the matching "single method interface" (at least that I could find, if someone knows the answer, please share!). So as a shim, I created:
public delegate void WorkerDelegate();

public WorkerWrapper : IWorker {
   readonly WorkerDelegate workerDelegate;
   public WorkerWrapper(WorkerDelegate workerDelegate) {
      this.workerDelegate = workerDelegate;
   }

   public void DoWork() {
      workerDelegate();
   }
}
So I wrap the lambda in a little wrapper and don't need to change my entire IWorker-based infrastructure. It works, but I'm not happy with this. I know that in the Java lambda mailing list, they are planning to include a "lambda conversion" so that lambdas can be converted to compatible single abstract method (SAM) types. This would've alleviated my need for the shim above as I could've assigned the lambda directly to an IWorker and all would've been well.

I believe the "C# way" would've been for me to use delegates all the way through to begin with. Had someone come along with an IWorker interface then that would've been assignable to my WorkerDelegate.

But is this the right answer? Conceptually, how should I think of these? What is an IWorker in the above case? Is it really just a chunk of code that should be passed around as such? Or is it a member in the space of collaborating types that make up my system...

This is an example of the conceptual problems reconciling a Functional view of the world with an object oriented view of the world (dualism). I know that many people smarter than me have thought about these problems, and I'm hoping that I can find some good articles discussing them.

It feels like we're describing the same concept: a "chunk of code" that is defined by an interface (call it a delegate or a SAM, same thing). I don't think that I would have any dissonance if both were assignable to each other, and thus can be treated as different expressions of the same concept. If this were the case, then maybe I would view delegates as just SAM types -- so my "world view" is still object oriented, I just have an additional, concise lambda syntax to create SAMs. Actually, if this were the case, then you could probably invoke the Liskov substitution principle and call functional-OO dualism reconciled...

But something still seems amiss. There is more to the identity of an IWorker than the fact that it takes no arguments and returns nothing. I suppose the same questions are true of reconciling structural typing to static typing. Hmm.. I have a lot of reading to do.

I imagine there are a number of philosophical problems between functional and OO. This is just the one I ran across and felt dissonance with C#s implementation. Maybe they truly are different things and should be treated as such. I hope (despite my extremely small readership) to get some links to articles on this topic.

Steve

Saturday, July 9, 2011

I <3 Robert C. Martin

Im reading Clean Code by Uncle Bob. I have read sections of this book before when it came out and actually had the pleasure of watching Robert Martin present it at SDWest in 2008. I've decided to read the whole thing.

A while ago, I read a blog post from someone who was arguing that software wasn't a craft but a trade. I believe the authors intention was to say that we software developers should recognize that the value of the software is the business value and thus we shouldn't wax philosophic about "elegance in design" or software aesthetics as that was all wasting time trying to get to the goal. I may be misrepresenting the author's intention. I couldn't find the post to link it.

In any case, I disagreed entirely with this opinion. While I agree that business value is the motivator-- the craft aspects such as aesthetics, conceptual purity, elegance, etc. All contribute to the solution and its extensibility and maintainability. Maybe we're just arguing over the definition of craft, trade, or art, but in any case I feel there is value in recognizing the challenge of good engineering for today and tomorrow. The masters do it almost effortlessly-- almost accidentally. That feels like art to me and thus should be labelled appropriately as craft.

To this point, clean code is more art than science and Mr. Martin has something to say about it that I really enjoyed:
Every system is built from a domain specific language designed by the programmers to describe their system. Functions are the verbs, classes are the nouns. This is not some throwback to the hideous old notion that the nouns and verbs in a requirements document are the first guess of the classes and functions of a system. Rather, this is a much older truth. The art of programming is, and always has been, the art of language design.

Master programmers think of systems as stories to be told rather than programs to be written. They use facilities of their chosen programming language to construct a much richer and more expressive language that can be used to tell that story. Part of that domain-specific language is the hierarchy of functions that describe all the actions that take place within that system. In an artful act of recursion those actions are written to use the very domain specific language they define to tell their own small part of the story.
So to argue that software is not art is to naively ignore the reality that language is hard and has a dramatic effect on the bottom line of your code base. How many software systems never change or never need to be understood after they are written? Such systems must not be very interesting or do anything important.

Let's recognize the art of good software engineering! It will motivate us to continue to improve if we recognize these things have a value.

Monday, June 27, 2011

Maven -_-

I have spent the last few days mucking about with POM files. Anyone that has done this understands where I'm going with this. So I'll just leave these two quotes that fit nicely:
But there has to be something fundamentally wrong with any tool that, whenever I use it, seems to have at least a 50% chance of completely fucking up my day.

-Charles Miller

The people who love Maven love the theory. The people who hate Maven hate the reality.

-Zutubi

Frustration.

EDIT: For anyone running across this. Now that I am well over the "learning curve" hump. I <3 maven. Seriously. Yes there are some rough edges-- I really should be able to delete folders simply without having to do ant-runs, etc. But its benefit drastically outweighs its cost in our environment.

Wednesday, April 13, 2011

OO Reading List

One of my favorite parts of Pragmatic Thinking is the description of the Dreyfuss model of skills acquisition. This describes the phenomena of how people are frequently distributed along some non-trivial "skill" (like programming), and defines metrics about what differentiates each skill level. Overall, as one moves up to higher skill levels, they are increasing their intuition about the problem space. Intuition is something for which we must train our brain through experience and knowledge. To that end, there are a few books that have helped me in increasing my intuition, which I would like to catalog. If you have others that are missing, please leave a comment! My appetite for the amazon marketplace is insatiable ;-)

OO and Design Patterns

  • But Uncle Bob Essay on SOLID - This is Robert C Martin's article outlining the principles of Object Oriented Design. If SOLID doesn't ring a bell, then start with this article. Note that the now-famous acronym: SOLID is not actually mentioned in this post, but Robert C Martin is still credited as the inventor.
  • Agile Software Development Principles, Patterns, and Practices - lovingly referred to as the PPP book (not to be confused with the protocol). This describes much of the why of OOD, and explains the Agile mindset.
  • Object Design - another often referenced book describing the why and various design decisions that go into object oriented design. Thinking about objects isn't hard-- category theory and abstraction is something that our brain does quite naturally (hence the appeal of this design methodology). However, the ergonomics of OOD can sometimes lead to an undeserved sense of self-confidence in one's design. It may feel like OOD, because "oh look-- there are objects there! inheritance! oh, my!", but within the context of software engineering there are many factors that make some designs good and some bad.
  • Design Patterns: Elements of Reusable Object Oriented Software - Canonical book by the "gang of four". Has good description of patterns and why they are useful. If you prefer something with prettier pictures, starting with the Head First Design Patterns book is nice too
  • Essays on OO Software Engineering - Maybe not terribly well-known, but well written theoretical overview of OO, the design forces in OOD, and the motivations behind them. I had the pleasure to work with Ed.
  • Patterns of Enterprise Application Architecture - Canonical book on patterns by the man himself
  • Refactoring - Another Fowler book. I haven't read this one cover to cover, but read many chapters. Good examples to get your head around particular refactoring patterns
  • Domain Driven Design - Eric Evan's now-famous book on how to turn business problems into rich object models. Great read.
  • Real World Java EE Patterns - while this is a "java" book, the principles and design trade-off analysis that Adam Bien does for each pattern is universal. Good read.

Code

  • Beautiful Code - Essays where programmers reflect on what is beauty in code. As the authors wax philosophic about their "beautiful" examples, you glean insight into their thought process. It's a nice read.
  • Implementation Patterns - Kent Beck's book about the low-level decisions we make as we actually type the code and create APIs (knowingly or unknowingly). One of my top recommendations. I really love this book.
  • Clean Code - guide to code readability, writing code at appropriate abstractions, etc. A nice companion to Implementation Patterns. I had the pleasure of meeting Robert C Martin at a conference once. I'm sure I acted appropriately star-struck.

Other "meta" and philosophic musings

  • Notes on the Synthesis of Form - this is not a computer science book, but deals with the theory of what is design, and describes a methodology for decomposition. It's a quick read, and at least the first half is worth the exploration just to get some ideas in your head about methodologies for design.
  • Pragmatic Programmer - what list would be complete without this one! Good overview of the pragmatic aspects of becoming a good programmer. From problem solving skills to tools, this is required reading. I think if you read this, then skip or at most skim Productive Programmer
  • Pragmatic Thinking: Refactoring your wetware - I mentioned this at the front of the post. I enjoyed most of this book-- in particular the front half. Some of the topics towards the end were less interesting, because they were more familiar. In any case at least the first three chapters are a great read.
  • The Little Schemer - you can run through this in an afternoon. This is an introduction to recursion, via Scheme/LISP. This book is fun, because its a simple and unique format, but takes you through a journey that helps shape your mind to "think" of solutions recursively.
  • Beautiful Architecture - This one wasn't as successful to me as Beautiful Code, but still worth a read. There are a few chapters you can skip, but a few (especially the first chapter) that are great.
  • Beautiful Data - I'm only about half way through this one, and need to spend more time with it, as I'm sure there are some gems in there I have yet to run across.

Reading TO-DOs

These are books that are on my shelf to read next (that are relevant to this list at least)
  • The Design of Design - Fred Brooks (of Mythical Man Month fame) waxes about design philosophies and goes through a few case studies. Im really excited to make time for this.
  • Masterminds of Programming - interviews with the creators of many of the major languages used over the last 30 years. Looks promising.
  • Thoughtworks Anthology - I am going to be selective in which essays I read. I will start with those that are most applicable to my world, and then move out to the more "exotic" (relatively). I know that there are essays in here that will be more useful to my current world than others.

I feel like I'm missing some...

Steve

Sunday, March 6, 2011

Spring 2011 Reading List

I gave last Spring's reading list (which was more what I had read in the previous year) in this post. This spring, I am going to write what I am currently reading or hope to finish this Spring.
  • Seven Languages in Seven Weeks - this book looks interesting, and certainly the challenge of trying to meaningfully cover seven languages and language philosophies in a short book will be interesting if nothing else.
  • Programming in Scala - I have "played" with Scala, but not done more than toy programs. However, from everything I know about it- I believe I will really enjoy it. I buy into a lot of the functional vs object oriented dualism, and certainly enjoy the terse syntax. I think there are still interesting questions with regards to software development economics (i.e. how do we integrate Scala in a team of mixed skill levels, cost trade offs, etc.)
  • Thinking Forth - I'm not particularly interested in Forth, but I am interested in language design and problem solving philosophies.
  • Domain Driven Design - this was recommended by a commenter in the previous post, and I picked it up. I've made it through quite a bit of the book, and have enjoyed it so far. I had always heard this book referenced by Fowler et al, and really I should've read it years ago...
  • Introduction to Reliable Distributed Programming - its a Springer book, 'nuff said. I've been through a few distributed computing books, and have a fairly broad knowledge of the space. I am hoping to get a more mature, theoretically rigorous view of the problems now.
  • The Art of Multiprocessor Programming - this book is great. I enjoyed the nice mix of intuitive description and theoretical rigor. It's a nice blend of theory and pragmatics. I really recommend reading this for anyone that wants in depth knowledge of parallel computing.
  • Introductions to Neural Networks for Java - this was an interesting book to skim I don't really want to recommend it, because most of it is explaining his neural network code base instead of the underlying concepts. In any case, its a nice thing to skim over an afternoon.
  • Neural Networks and Learning Machines - Great textbook for all the theoretical background and mathematical proofs regarding Neural Nets and machine learning. I've had to crack open my calc books to freshen up while going through this...its dense.
  • Fuzzy Models and Genetic Algorithms for Data Mining and Exploration - I can't recommend this book (see my Amazon review if you're curious why), but it does provide a decent vocab overview for the topics it covers.

Well that's it for the moment! This spring is a bit more theory heavy than the previous list... that probably reflects my work and school change, which has been more research oriented in the last year. I'm not moving away from the real world just trying to get a more comprehensive grasp on both worlds in the areas which I am interested. I've always thought this was one of my strengths-- knowing enough theory to shape my thinking skills and be knowledgeable-- but continuously, deliberately enriching my implementation and practical skills to actually put the theory to good use! The intersection of the two is the more rewarding area for me, I think.

Steve