Tuesday, April 16, 2013

paging s3 listObjects as a scala stream

One of Scala's fun features is its lazy list Stream class. I guess for a Haskell programmer a lazy list is not a big deal, but it's cool for the rest of us java, ruby, python, javascript, C#, ... programmers.

Anyway - I plan to host some web content on S3, so I've been writing some scala code using S3's java API to automate some tasks. One of the patterns S3 employs is to "page" large result sets, and scala streams provide a natural way to load the pages on demand - something like this:

      lazy val s3Listing:Stream[s3.model.ObjectListing] = 
          s3Client.listObjects( 
            new s3.model.ListObjectsRequest( folder.getHost, queryFolderPath, null, "/", null ) 
          ) #:: s3Listing.takeWhile( _.isTruncated ).map( (part) => s3Client.listNextBatchOfObjects( part ) )

Pretty cool, but, unfortunately, Stream has some quirk where its flatMap method can easily overflow the stack. For example - the following code for assembling the directories in a file system explodes:

          lazy val folderStream:Stream[jio.File] = new java.io.File( folder.path.getPath ).getCanonicalFile #:: 
            folderStream.flatMap( (d) => { d.listFiles().filter( (sd) => sd.isDirectory ) } )

We can construct a stream, and avoid calling flatMap with a recursive method that manages its own stack.

          def depthFirst( stack:List[jio.File] ):Stream[jio.File] = {
            stack match {
              case head :: tail => {
                  head #:: depthFirst( tail ++ head.listFiles.filter( _.isDirectory ) )
              }
              case Nil => Stream.empty
            }
          }

          depthFirst( List( root ) )
Post a Comment