In the first part of this series we looked at how to get Kafka Connect setup with the Cassandra Source connector from Landoop. We also took a look at some design considerations for the Cassandra tables. In this post we will examine some of the options we have for tuning the Cassandra Source connector.
Configuring when to start looking for data
The Cassandra Source connector pulls data from a Cassandra table based on a date/time column. The first property we want to configure will tell the connector when it should start looking for data in the Cassandra table. This is set with the connect.cassandra.initial.offsetproperty. Any data that exists prior to this date/time will not be published to the Kafka topic.
If this property is not set then the connector will use the default value of Jan 1, 1900. This is not what you will want as it will cause long delays in publishing data to the Kafka topic. This delay is the result of the connector having to work its way through more than a century of time slices before reaching your data. How fast it does this will be determined by how some of the other properties are configured.
Once the connector has picked up data from the table and successfully published messages to Kafka it will store the date/time of the last published row as an offset in the Kafka topic connect-offsets. Once a value has been published to the topic the connector will always use it over the value provided by initial-offset.
Configuring how often to check for data in the table
The connector works by polling the table and looking for new rows that have been inserted since it last checked for data. How often to poll for data is managed by the connect.cassandra.import.poll.intervalproperty. The configuration shown below will look for new data every ten seconds.
If the connector is still processing rows from the result set in the prior polling cycle, it will not query the table for more data. That polling cycle will be skipped, at least in regards to querying the Cassandra cluster. The polling cycle is still important as this also determines how often data is published to the Kafka topic. Continue reading “Tuning the Kafka Connect Cassandra Source (part 2)”→
This post will look at how to setup and tune the Cassandra Source connector that is available from Landoop. The Cassandra Source connector is used to read data from a Cassandra table, writing the contents into a Kafka topic using only a configuration file. This enables data that has been saved to be easily turned into an event stream.
In our example we will be capturing data representing a pack (ie a large box) of items being shipped. Each pack is pushed to consumers in a JSON format on a Kafka topic.
The Cassandra data model and Cassandra Source connector
Modeling data in Cassandra must be done around the queries that are needed to access the data (see this article for details). Typically this means that there will be one table for each query and data (in our case about the pack) will be duplicated across numerous tables.
Regardless of the other tables used for the product, the Cassandra Source connector needs a table that will allow us to query for data using a time range. The connector is designed around its ability to generate a CQL query based on configuration. It uses this query to retrieve data from the table that is available within a configurable time range. Once all of this data has been published, Kafka Connect will mark the upper end of the time range as an offset. The connector will then query the table for more data using the next time range starting with the date/time stored in the offset. We will look at how to configure this later. For now we want to focus on the the constraints for the table. Since Cassandra doesn’t support joins, the table we are pulling data from must have all of the data that we want to put onto the Kafka topic. Data in other tables will not be available to Kafka Connect.
Last week several of my colleagues and I were able to head up to New York City to attend the Kafka Summit. It was a blast being in the Big Apple! I wanted to share some observations about the summit and some of the sessions.
The first observation is that there is a lot of buzz around event/stream processing and the Kafka streaming platform. The hotel in the center of Manhattan was top notch. As one entered the summit, the Kafka logo was projected on the walls and was even emblazoned on the cookies. There were several companies that gave presentations on what they are doing with the platform including Linked In, ING, Airbnb, Uber, Yelp, Target, and the New York Times.
To get things kicked off, Jay Kreps gave the opening keynote (available here) and explored the question: what is a streaming platform. He noted the three things that are required.
the ability to publish and subscribe to streams of data
the ability to store and replicate streams of data
the ability to process streams of data
The Kafka streaming platform, comprised of Apache Kafka along with the Kafka Producer and Consumer APIs, Kafka Streaming, and Kafka Connect, provides all of these capabilities.
Jay Kreps explored the three possible lenses that one is tempted to place Kafka into – Messaging, Big Data, and/or ETL. But really the Kafka streaming platform encompasses all of these.
One of the best quotes at the summit came during Ferd Sheepers, architect at ING, keynote. He compared the current state of streaming platforms to going through puberty.
Camel 2.19 was recently released. You can read about that on Claus Ibsen’s blog. A quick look through the release notes and it will be readily apparent that there are numerous new features that went into this version.
The basic idea behind the circuit breaker is very simple. You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all.
Now if you have used Camel for any length of time you may be wondering why is there another implementation of this pattern. After all Camel already provides two.
The first was released with 2.14 and was added by extending LoadBalancerSupport in the CircuitBreakerLoadBalancer. The documentation can be found on the Camel site under the Load Balancer EIP. The slightly modified sample DSL is from the Camel documentation.
Our team is reading through the classic Clean Code together. After reading a chapter or two we will get together to discuss the concepts over lunch. We try to keep them fun and interactive. These are the slides from our first session (link).
In discussing the question – why don’t we write clean code – we explored the following three reasons.
There is a level of subjectivity
There is a good chance that when I opened a pull request for my team to evaluate, I thought the code being pushed read like well written prose, was understandable to others, tested, and was thus maintainable by my colleagues. However, the code should be considered readable by the team that is responsible for owning and maintaining it. Which usually means that there will be some comments and feedback.
There may not be an understanding of what Clean Code is
If writing clean code was obvious I imagine that Bob Marin would not have written a book on it. And sites , like the Daily WTF, poking fun at various “dirty” code would not exist. Understanding what clean code looks like and the techniques to improve it must be learned. Our goal as a team is to work through Clean Code so everyone on the team will know what clean code is and why it is important. Continue reading “Don’t be a half-witted, nerf herder. Or what is Clean Code?”→
There are several computer books that have become classics. One of these is Code Complete by Steve McConnell. The first edition was written in 1993. That goes back to when I started collecting a paycheck as a professional developer. And it precedes classics like the Gang of Four’s Design Patterns (1994), The Pragmatic Programmer: From Journeyman to Master (1999), Agile Software Development: Principles, Patterns, and Practices (2002) and Clean Code (2008). Some consider this book to be the first collection of coding practices.
In this book, McConnell stresses the importance of readable code.
He notes that writing readable code is one of the things that separates the great coders from the rest.
A great coder [level 4] who doesn’t emphasize readability is probably stuck at Level 3, but even that isn’t usually the case. In my experience, the main reason people write unreadable code is that their code is bad. They don’t say to themselves “My code is bad, so I’ll make it hard to read.” They just don’t understand their code well enough to made it readable, which puts them at Level 1 or Level 2.
Even before Code Complete, the book Structure and Interpretation of Computer Programs, written by Abelson and Sussman, was published in 1985. The full text of the book is available online (link). The preface to the first edition (link) contains the oft repeated line:
programs must be written for people to read, and only incidentally for machines to execute.