Wednesday, March 06, 2019

From zero to prod ETL in 30 minutes with Streamsets

Now that I work for MongoDB I work with Streamsets quite a bit however a while back I was on a different journey at a customer site....
we were struggling with picking an ETL tool. Many that we looked at were very pricy and required significant admin and resource time to manage luckily we found Streamsets

Streamsets allows you to set up a drag and drop ETL system within a few minutes they offer a containerized version  and a tarball you can build manually if you wish I recommend strongly using the dockerhub build.  SDC has every origin and destination you can imagine and processors in the middle say to like convert CSV to JSON or something of such. This is a really cool connected car demo that utilizes a lot of the functionality and should give you an idea what Streamsets looks like and feels like we will step through building a more basic pipeline down below. You can view the connected car demo HERE

Pretty cool right? Lets build a basic pipeline using Streamsets now first step pull down the image from dockerhub............
(this is assuming you have Docker installed)


jefferys-mbp:Desktop jefferyschmitz$ docker pull streamsets/datacollector
Using default tag: latest
latest: Pulling from streamsets/datacollector 
e20c4e30543a: Pull complete 
a5f9fc83acf6: Downloading [===========>                                       ]  18.27MB/78.82MB
78a3db3b6dea: Download complete 
14c4058c4e3b: Download complete 
c4cf8bd338cf: Downloading [==>                                                ]  16.69MB/340MB
d7bb309b44cf: Download complete 
5c7a4ebae034: Download complete 
ae8b45618636: Download complete 


Now that we have the SDC image pulled down you should be able to fire it up as such
running this command

docker run --restart on-failure -p 18630:18630 -d --name streamsets-dc streamsets/datacollector dc

Now fire up your browser and cruise on over to http://localhost:18630/ and you should see a screen that looks like this:



default user and pass is admin / admin this can be changed in the settings menu

Once logged in you will see a pretty blank admin console lets build a pipeline by clicking on "create new pipeline" in the upper left hand corner.



So we give a name and a description and off we go!



So now you will be presented with a screen that has no pipelines in it so lets create a simple one - On the right side of the UI you will see a menu with Origin, Executors and Destinations... This is gonna be the simplest pipeline you can create we are going to select Origin and grab the SFTP widget and pull it into our pipeline creator. Then we are going to grab local file and drag it over to the right of the SFTP origin then connect the two should look like this:



On the info line for both boxes make sure you fill out the proper credential for each processing stage its pretty straightforward stuff like site url, password, data format - all must be filled out for each stage you will see errors before running the pipeline if you missed something. 

Once all your connections are setup press the start button underneath the gears on the upper right hand sign and data will start flowing from the SFTP client to the local FS

Streamsets also has a great team behind the scenes that is quick to respond to feature requests a really fast turnaround times working with them is really a breath of fresh air
feel free to email me with any questions you may have or contact Streamsets directly!!

Happy Hacking!!!!


Creating Atlas Cluster GCP /24 in Terraform

  1. Generating MongoDB Atlas Provider API Keys In order to configure the authentication with the MongoDB Atlas provider, an API key must be...