An Introduction to Filament
2024-10-01
These days, everybody wants to design custom hardware and they want to do it quickly. Historically, the go-to tool for that has been an HDL like SystemVerilog or VHDL. Many people (me included) are dissatisfied with this; traditional HDLs provide a very tedious and error-prone programming model. In response, a lot of very smart people have thought up interesting alternatives to traditional HDLs with the goal of making hardware design more accessible and productive (links to come). Here, I'll talk about the one I work on: Filament.
Before we get there, though, we'll work through a small example in Verilog that will highlight the specific problem that we want to solve with Filament.
An example
Let's start small -- we'll design a simple ALU that can support multiplication and addition. I'll be writing my code snippets in Verilog, and I'll explain each step so even if you've never seen an HDL before, you can still follow what we're doing.
Here's what our signature will look like. If anyone else wants to use our ALU in their design, this is what they'll see.
module alu
(
input logic [31:0] in0,
input logic [31:0] in1,
input logic opsel,
output logic [31:0] out
);
Our signature is straightforward: we have 32-bit operands and have a 1-bit control signal to select between multiplication and addition. Let's move on to the first part of the component's body.
logic [31:0] add_result;
assign add_result = in0 + in1;
We can use Verilog's +
operator and it'll synthesize just fine. Let's move on to multiplication.
logic [31:0] mult_result;
assign mult_result = in0 * in1;
This would also work fine, but let's say that we aren't happy with how our downstream tools will synthesize the *
operator.
Luckily, we have access to someone else's multiplier implementation; let's instantiate that instead.
logic [31:0] mult_result;
imul multiplier
(
.clk (),
.reset (),
.in0 (),
.in1 (),
.out ()
);
Here, imul
is the module name and multiplier
is the named instantiation of it inside our circuit. It looks like we'll need
clk
and reset
signals for it; let's add those to our module signature.
module alu
(
input logic clk,
input logic reset,
input logic [31:0] in0,
input logic [31:0] in1,
input logic opsel,
output logic [31:0] out
);
Now, we can make the port connections for our instantiation.
logic [31:0] mult_result;
imul multiplier
(
.clk (clk),
.reset (reset),
.in0 (in0),
.in1 (in1),
.out (mult_result)
);
One more step: we need to drive the output of our ALU. We'll do that with a mux to choose between the add and multiply outputs.
assign out = opsel ? add_result : mult_result;
out
to add_result
when opsel
is 1 and mult_result
when it is 0.
Looks like we're done! Time to test our ALU. I won't go through the short test bench I wrote for our ALU, but if you're curious you can view the code here.
I made a file alu_tb.v
which can be used to simulate our design with iverilog
:
% iverilog -Wall -g2012 -o alu_tb alu_tb.v
I included print statements in the test bench, so we can inspect the output:
% ./alu_tb
cycle 0: in0 = 00000002, in1 = 00000004, opsel = 1
cycle 0: out = 00000006
cycle 1: in0 = 00000002, in1 = 00000004, opsel = 0
cycle 1: out = 00000000
We start out by testing our ALU's add functionality, and 2 + 4 does indeed equal 6! We test multiplication on the next cycle, but something is off. Instead of getting 8 like we expect, we get 0. This is weird, so let's dig into the waveform that we generated from our test bench.
At cycle 0, we see that opsel
is 1, meaning we are computing the addition. In that same cycle, out
is 6. This makes sense, because addition is combinational. In the next cycle,
we set opsel
to 0. However, out
doesn't become 8 until cycle 3.
Aha! This is an important clue. Implicit in our ALU design was our assumption that the multiplier would return its output in the same cycle we sent its inputs. The waveform tells us this can't be true, but let's look at the code for the multiplier to confirm:
logic [31:0] mult_s1;
logic [31:0] mult_s2;
logic [31:0] mult_s3;
logic [31:0] mult_s4;
always_ff @(posedge clk) begin
if (reset) begin
mult_s1 <= 32'd0;
mult_s2 <= 32'd0;
mult_s3 <= 32'd0;
end else begin
mult_s1 <= in0 * in1;
mult_s2 <= mult_s1;
mult_s3 <= mult_s2;
mult_s4 <= mult_s3;
end
end
assign out = mult_s4;
Let's step through this block of code. This multiplier has 4 pipeline stages. In the first stage, we compute the multiplication
using *
. mult_s1
, mult_s2
, mult_s3
, and mult_s4
each represent the pipeline registers we use to store the result of each stage.
The always_ff
block specifies what assignments we should make at the start of each clock cycle. If reset is asserted, we should
set all of our registers to 0. Otherwise, we can forward the computation along from the previous register.
The fundamental problem with our original ALU was the fact that we expected the results from the adder and the multiplier to be ready at the same time. Let's go back and fix that:
logic [31:0] add_s1;
logic [31:0] add_s2;
logic [31:0] add_s3;
logic [31:0] add_result;
always_ff @(posedge clk) begin
if (reset) begin
add_s1 <= 0;
add_s2 <= 0;
add_s3 <= 0;
add_result <= 0;
end else begin
add_s1 <= in0 + in1;
add_s2 <= add_s1;
add_s3 <= add_s2;
add_result <= add_s3;
end
end
This block of code computes the addition and then saves the result until the multiplication is finished. Let's try testing again.
./alu_tb
cycle 0: in0 = 00000002, in1 = 00000004, opsel = 1
cycle 0: out = xxxxxxxx
cycle 1: in0 = 00000002, in1 = 00000004, opsel = 1
cycle 1: out = xxxxxxxx
cycle 2: in0 = 00000002, in1 = 00000004, opsel = 1
cycle 2: out = xxxxxxxx
cycle 3: in0 = 00000002, in1 = 00000004, opsel = 1
cycle 3: out = 00000006
cycle 4: reset = 1
cycle 5: in0 = 00000002, in1 = 00000004, opsel = 0
cycle 5: out = 00000000
cycle 6: in0 = xxxxxxxx, in1 = xxxxxxxx, opsel = 0
cycle 6: out = 00000000
cycle 7: in0 = xxxxxxxx, in1 = xxxxxxxx, opsel = 0
cycle 7: out = 00000000
cycle 8: in0 = xxxxxxxx, in1 = xxxxxxxx, opsel = 0
cycle 8: out = 00000008
alu_tb.v:70: $finish called at 101 (1s)
On cycle 0, we send the ALU our addition. It doesn't finish until cycle 3, where we see the correct output. On cycle 5,
we send the ALU our multiplication. It doesn't finish until cycle 8, where we see the correct output. If you look closely
at the above trace, you'll see that when we send the multiplication, we don't have to assert in0
and in1
for all 4 cycles;
looking at our implementation, you'll notice that's because we only read from in0
and in1
in cycle 0. You'll notice something
similar for opsel
; we only read from it in the very last cycle of our computation.
Now, that was a lot of information about our circuit. Information which is critical to our circuit functioning properly, especially if it is interacting with other circuits. And, as you definitely noticed, none of it was explicit in our Verilog code. We had to carefully extract all of it and try real hard not to forget any of it. What if there was a better way?
The better way
Now, we can finally understand where Filament comes in. This is what our ALU's signature would look like in Filament:
comp ALU<'G:1>(
go: interface['G],
in0: ['G, 'G+1] 32,
in1: ['G, 'G+1] 32,
opsel: ['G+3, 'G+4] 1
) -> (
out: ['G+3, 'G+4] 32
)
This is necessarily more verbose than a Verilog signature. Let's break it down. comp ALU
means this is a component named ALU;
simple enough. <'G:1>
means that there is an event 'G
associated with this component that has a delay of 1; more on that later.
go: interface['G]
means we can think of the event 'G
as signaling the "start" of the computation. Filament's events correspond
directly to clock cycles, so 'G
is really a clock cycle. Everything else inside of ALU
happens in relation to 'G
.
in0: ['G, 'G+1] 32
means in0
is a port of width 32, and it has the availability interval of ['G, 'G+1]
. This idea corresponds
directly to the observation we just made about our ALU only reading from in0
and in1
in the first cycle of its computation. We've
encoded this directly into the language, and soon we'll see the cool things we can do with that. Similar observations can be made about
in1
, opsel
, and out
.
Now, let's implement our ALU and see what happens if we make the same mistake as before. We have access to Filament's primitive library, which includes components like adders, multipliers, and muxes. Here is the signature for an adder:
comp Add[W]<'G: 1>(
right: ['G, 'G+1] W,
left: ['G, 'G+1] W,
) -> (
out: ['G, 'G+1] W
) W > 0
Quick note: Filament components can take parameters, which go in the brackets directly after the component name. Here,
W
is a parameter we use to represent the width of our operands.
We can tell this component is combinational because its outputs are produced in the same cycle that its inputs are provided.
Here's the multiplier:
comp FastMult[W]<'G: 1>(
left: ['G, 'G+1] W,
right: ['G, 'G+1] W,
) -> (
out: ['G+3, 'G+4] W,
) where W > 0
Again, now we know the precise timing behavior of our multiplier. What happens if we make a mistake similar to the one we made in our Verilog implementation? We'll try it inside our Filament ALU.
add := new Add[32]<'G>(in0, in1);
mult := new FastMult[32]<'G>(in0, in1);
mux := new Mux[32]<'G+3>(opsel, add.out, mult.out);
out = mux.out;
In the first three lines, we are creating an instance of a component and then invoking it at the specified time with
the given inputs. In the case of the adder, we instantiate the Add
component at time
'G
with inputs in0
and in1
. We bind
this invocation to the name `add
, so we can refer to its output ports later, like when we pass it into the mux. When we try to
compile this code, we get the following error:
error: source port does not provide value for as long as destination requires
┌─ ../../proj/filament-blog1/alu.fil:14:37
│
14 │ mux := new Mux[32]<'G+3>(opsel, add.out, mult.out);
│ ^^^^^^^
│ │
│ source is available for ['G, 'G+1]
│ required for ['G+3, 'G+4]
Compilation failed with 1 errors.
add
is available at 'G
, but
the mux is expecting its inputs at 'G+3
because we invoked it at `'G+3
to account for the multiplier's delay. Now that we know
the problem, the fix is simple: extend the availability of add.out
using a register.
add := new Add[32]<'G>(in0, in1);
add_reg := new Register[32]<'G, 'G+4>(add.out);
mult := new FastMult[32]<'G>(in0, in1);
mux := new Mux[32]<'G+3>(opsel, add_reg.out, mult.out);
out = mux.out;
Now, we get a different error from the compiler.
error: event provided to invocation triggers more often that invocation's event's delay allows
┌─ ../../proj/filament-blog1/alu.fil:14:33
│
5 │ comp ALU<'G:1>(
│ - this event triggers every 1 cycles
·
14 │ add_reg := new Register[32]<'G, 'G+4>(add.out);
│ ^^ event provided to invoke triggers too often
│
┌─ ./primitives/state.fil:6:29
│
6 │ comp Register[WIDTH]<'G: 'L-('G+1), 'L: 1>(
│ --------- invocation's event is allowed to trigger every 3 cycles
Compilation failed with 1 errors.
'G
. In our signature, we specified
that 'G
has a delay of 1, which means that ALU can process a new input every cycle. If we look more closely at the error, we
see that it is telling us that because we use the register over the interval 'G, 'G+4
, we can't possibly handle new inputs. It is most
clear with an example: if we get a set of inputs at cycle 0, that computation will be using the register over cycles 0 to 4. If we then
get a set of inputs at cycle 1, that computation will need to use the register over cycles 1 to 5. This overlap causes a problem, since
we only have a single physical register. There are actually two fixes: we can either alter our signature to reflect this constraint by
changing the delay of 'G
to 3, or we can alter our design to achieve a delay of 1. Let's explore the second option:
add := new Add[32]<'G>(in0, in1);
r0 := new Register[32]<'G, 'G+2>(add.out);
r1 := new Register[32]<'G+1, 'G+3>(r0.out);
r2 := new Register[32]<'G+2, 'G+4>(r1.out);
mult := new FastMult[32]<'G>(in0, in1);
mux := new Mux[32]<'G+3>(opsel, r2.out, mult.out);
out = mux.out;
We specify our inputs to the Filament design with a JSON:
{
"opsel": [1, 0, 1, 0],
"in0": [3, 4, 5, 6],
"in1": [8, 9, 2, 3]
}
opsel
with a value of 1 on cycle 0, a value of 0 on cycle 1, and so on. Our output looks like this:
{"out": {"0": [11], "1": [36], "2": [7], "3": [18]}, "cycles": 7}
With Filament, we were able to express the fundamental constraints on our circuit through language-level constructs. This allowed the compiler to automate much of the reasoning that we had to do ourself when we were using Verilog. One important limitation of Filament is that it can only express statically-scheduled circuits, which are circuits whose timing behavior is input-independent. In our example, our multiplier always took 4 cycles. However, you can imagine a multiplier that, for example, uses a left shift to compute its output if it detects that one of its operands is a power of two, for example. This is a dynamic circuit, which cannot currently be expressed in Filament.
Still, this post has only scratched the surface of what is possible in Filament. We'll get into the rest of it in future posts.