We’re happy to announce that we’re open sourcing one component of our Kafka operations toolkit: an algorithm that computes a new partition assignment to brokers while obeying the following properties: Data movement is minimized Partitions for each topic are evenly distributed across brokers Multiple replicas for each partition never appear in the same rack (or […]
This post is the first in a series that describes our experience in adopting Apache Kafka. Here at Sift Science, we have introduced Kafka as the messaging layer between our distributed services. In particular, we treat Kafka topics as message queues, which source services produce to and target services consume from. For example, if we […]
Adventures in React Performance Debugging
Recently I read Benchling’s 2-part series in debugging performance issues in React, and it really echoed the issues and solutions that I’ve been working through on the Sift Science Console. So I was inspired to chime in with some of my own React performance debugging experiences in what may become a short series itself.
A few weeks ago, I attended my first Grace Hopper Celebration as a technical speaker. There, I presented twice, sharing some findings from my research work on human-robot interaction. I walked away from the conference having learned two new words: pipeline and retention. Don’t get me wrong – I know and understand these words individually, but I’ve never heard these two words used more frequently anywhere else in my life. Every keynote and plenary speaker talked about either the pipeline problem, the retention problem, or both. They picked sides over which is the bigger problem, or waffled between the two. Every brunch, linner, and dinner conversation revolved around these two keywords.
At Sift Science, we use a variety of popular machine learning models to detect fraud for our customers. However, until recently we relied exclusively on a combination of linear models and sophisticated feature engineering. As we were reaching the limits of this setup, we began experimenting with our first non-linear model: random decision forests. Several months and over 100 experiments later, we were thrilled to announce the addition of random decision forests to our ensemble of models used to fight fraud. Along the way we learned quite a few things about designing a random decision forest classifier for the fraud detection use case. Here we detail several of these learnings, including how we handled sparse and missing features, useful model visualization techniques, heuristics we used to improve class separation, specialized feature engineering, and how we combined our random decision forest with our existing models. All told, these learnings resulted in an 18% reduction in error for our customers.
We really love tech talks.
At Sift Science, sharing knowledge and facilitating great discussion are two of our favorite things (just behind fraud-fighting, board games, ML, and really beautiful data visualization). In that vein, we’ve been delighted to host a summer tech talk series entitled Turn Up The Bayes, where we invite awesome engineers to chat about the interesting things that they’re working on. To set the mood, we provide delicious pizza and refreshing beverages, and set aside plenty of time for discussion, questions, and more pizza.
On Thursday July 30, at 10:10 PM PST, the Sift Science API became unavailable. Service was fully restored at 11:37 PM and the backlog of data was clear by 1:10 AM on July 31st. As a result of this outage, events sent to the Sift API were rejected and we were unable to provide scores. We know that our users rely on Sift Science to be always available, and last night we did not live up to that expectation. We’d like to explain what happened during this outage, and what we’re doing to prevent outages in the future.
Here at Sift Science, we just completed another big step in our ongoing marketing site redesign, overhauling the homepage and replacing old landing pages with [prettier, responsive, and more performant ones]. While the big performance improvements aren’t quite ready to showcase yet (check in soon for more on that), I realized that there are a few custom Sass mixins and placeholders that I rely on heavily for responsive development—I’m not actually sure what I’d do without them—and I thought I’d share them here along with some CodePens so that other people might also take advantage of them!
We’re adding random decision forests to our machine learning solution, so get ready for an 18% improvement in Sift Score accuracy!
This week, we launched an entirely new machine learning model called random decision forests, which will work alongside our existing models. Why? For an additional layer of prediction power, of course. With Sift Science’s decision forests in place, we expect that, on average, our customers will see a significant increase in fraud detection accuracy. This added model makes our online and large-scale learning capabilities even more robust!
This week, we hosted the first session of our new summer speaking series (Turn Up The Bayes). I gave a talk on how we leverage a distributed database, HBase, to power an infrastructure that enables performant, distributed online learning. The following is a brief summary…but first, a quick introduction.
Fraudsters always search for new ways to exploit opportunities at the expense of companies that provide legitimate goods and services. At Sift Science, we use real-time supervised machine learning to sabotage fraudster plots. As it turns out, the “real-time” portion of our product brings significant infrastructure challenges.